Spam Detection with Natural Language Processing (NLP) – Part 1

     Part 1: Data Cleaning and Exploratory Data Analysis

Spam detection with NLP

Predicting whether an SMS is a spam

 

Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages.

When I first began learning NLP, it was difficult for me to process text and generate insights out of it. Before actually diving deep into NLP, I knew some of the basic techniques in NLP before but never could connect them together to view it as an end to end process of generating insights out of the text data.

In this blog, we will try to build a simple classifier using machine learning which will help in identifying whether a given SMS is a spam or not. Parallely, we will also be understanding a few basic components of Natural Language Processing (NLP) for the readers who are new to natural language processing.

Building SMS SPAM Classifier

In this section, we will be building a spam classifier step by step.

Step 1: Importing Libraries

We will be using pandas, numpy and Multinomial naive Bayes classifier for building a spam detector. Pandas will be used for performing operations on data frames. Furthermore using numpy, we will perform necessary mathematical operations.

Step 2: Reading the dataset and preparing it for basic processing in NLP

First, we read the csv using pandas read_csv function. We then modify the column names for easy references. In this dataset, the target variable is categorical (ham, spam) and we need to convert into a binary variable. Remember, machine learning models always take numbers as input and not the text hence we need to convert all categorical variables into numerical ones.

We replace ham with 0 (meaning not a spam) and spam with 1 (meaning that the SMS is a spam)

Step 3: Cleaning Data

Well, Cleaning text is one of the interesting and very important steps before performing any kind of analysis over it. Text from social media and another platform may contain many irregularities in it. People tend to express their feeling while writing and you may end up with words like gooood or goood or goooooooooooood in your dataset. Essentially all are same but we need to regularize this data first. I have made a function below which works fairly well in removing all the inconsistencies from the data.

Clean_data() function takes a sentence as it’s input and returns a cleaned sentence. This function takes care of the following

  1. Removing web links from the text data as they are not pretty much useful
  2. Correcting words like poooor and baaaaaad to poor and bad
  3. Removing punctuations from the text
  4. Removing apostrophes from the text to correct words like I’m to I am
  5. Correcting spelling mistakes

Below is the snippet for clean_data function

Function to remove punctuations from the sentence

Function to remove apostrophes from the sentences

Example of using the clean_data function

Now in order to process and clean all the text data in our dataset, we iterate over every text in the dataset and apply the  clean_data function to retriever cleaner texts

Step 4: Understanding text data and finding Important words

After cleaning our text data, we want to analyze it but how de analyze text data? In the case of numbers, we could have gone with finding out mean, median, standard deviation, and other statistics to understand the data but how do we go about here?

We can not take a whole sentence up and generate meaning from it. Although, we can take words from those sentences and try to find out words that are frequently occurring in the text document or finding out the words which hold relatively higher importance in helping us understand what the complete sentence is about. In case of identifying a message as spam, we need to understand that are there any specific words or sequence of words that determine whether an SMS is a spam or not.

Tokenization and Lemmatization

We start by breaking each sentence into individual words. So a sentence like “Hey, You are awesome” will be broken into individual words into an array [‘hey’, ‘you’, ‘are’, ‘awesome’]. This process is known as tokenization and every single word is known as tokens. After getting each token, we try to get each token into its most basic form. For example, words like studies and goes will become study and go respectively. Also, remember that we need to remove stopwords like I, you, her, him etc as these words are very frequent in the text and hardly lead to any interpretation about any message being a spam or not!

Given below, I have made a tokenizer function which will take each sentence as input. It splits the sentence into individual tokens and then lemmatizes those words. In the end, we remove stop words from the tokens we have and return these tokens as an array.

Example showing the working of my_tokeniser function

Understanding n-grams

An n-gram is a contiguous sequence of n items from a given sequence of text. Given a sentence, swe can construct a list of n-grams from s finding pairs of words that occur next to each other. For example, given the sentence “I am Kartik” you can construct bigrams (n-grams of length 2) by finding consecutive pairs of words which will be (“I”, “am”), (“am”, “Kartik”).

A consecutive pair of three words is known as tri-grams. This will help us to understand how exactly a sequence of tokens together determines whether an incoming message is a spam or not. In natural language processing (NLP), n-grams hold a lot of importance as they determine how sequences of words affect the meaning of a sentence.

We will be finding out most common bi-grams and tri-grams from the messages we have in the dataset separately for both spam and non-spam messages and consecutively will have a look at most commonly occurring sequences of text in each category.

Code for finding out bi-grams and tri-grams

Below is a python function which takes two input parameters i.e. label and n. The “label” parameter is the target label of the message. For spam messages, it is 1 whereas for non-spam messages it is 0. The “n” parameter is for selecting whether we want to extract bi-grams out or tri-grams out from the sentences. A too much high value for n will not make any sense as long sequences of text are majorly not common throughout the data

We will call the below function to directly plot all the common bigrams or trigrams as a word cloud. This function calls the above function to get all the bi_grams or tri_grams from the messages we have and will then plot it

Most frequent words in spam messages

word cloud result of spam detection NLP model - 2

Most frequent words in non-spam messages

word cloud result of spam detection NLP model

Top 15 frequent bigrams for non-spam messages

result of ML algorithm - 1

Top 15 frequent bigrams for spam messages

result of ML algorithm - 2

Visualizing most frequent trigrams for non-spam messages

result of ML algorithm - 3

Visualizing most frequent trigrams for spam messages

result of ML algorithm - 4
Conclusion

Till now we have learned how to start with cleaning and understanding data. This process needs to be done before any kind of text analysis. One should always start with cleaning the text and then move on to fetch tokens out of the text. Getting tokens out of the text also requires to exclude stop words. Also, we need to get all the other words into their basic morphological form using lemmatization. In the next blog, we will have a look at finding out important words from the text data. We will also learn the word embeddings. In the end, we will finally build a classifier to segregate spam SMS out.

Stay tuned!

Data Cleaning, Categorization and Normalization

Data Cleaning, Categorization and Normalization

Data Cleaning, categorization and normalization is the most important step towards the data. Data that is captured is generally dirty and is unfit for statistical analysis. It has to be first cleaned, standardized, categorized and normalized, and then explored.

Definition of Clean Data

Happy families are all alike; every unhappy family is unhappy in its own way – Leo Tolstoy

Like families, clean datasets are all alike but every messy dataset is unreadable by our modeling algorithms.Clean datasets provide a standardized way to link the structure of a dataset with its semantics.

We will take the following text file with dirty data:

%% Data
Sonu ,1861, 1892, male
Arun , 1892, M
1871, Monica, 1937, Female
1880, RUCHI, F
Geetu, 1850, 1950, fem
BaLa, 1893,1863
% Names, birth dates, death dates, gender

Let us start with tidying the above data. We’ll adopt following steps for the same:

1. Read the unclean data from the text file and analyse the structure, content, and quality of data.

The following functions lets us read the data that is technically correct or close to it:

  • read.table
  • read.csv
  • read.csv2
  • read.delim
  • read.delim2

When the rows in the data file are not uniformly formatted, we can consider reading in the text line-by-line and transforming the data to rectangular text ourself.

The variable txt is a vercot of type “character” having 8 elements, equal to number of lines in our txt file.

2. Delete the irrelevant/duplicate data. This improves data protection, increases the speed of processing and reduces the overall costs.

In our eg., we will delete the comments from the data. Comments are the lines followed by “%” sign using following code.

Following is the code:

3. Split lines into seperate fields. This can be done using strsplit function:

Here, txt was a vector of type characters, while, after splitting we have our output stored in the variable “fields” which is of “list” type.

4. Standardize and Categorize fields

It is a crucial process where the data is defined, formatted, represented and structured in all data layers. Development of schema or attributes is involved in this process.

The goal of this step is to make sure every row has same number of fields, and the fields are in the same order. In read.table command, any fields that are less than the maximum number of fields in a row get appended by NA. One advantage of do-it-yourself approach is that we don’t have to make this assumption. The easiest way to standardize rows is to write a function that takes a single character vector as input, and assigns the values in the right order.

The above function takes each line in the txt as the input (as x). The function returns a vector of class character in a standard order: Name, Birth Year, Death Year, Gender. Where-ever, the field will be missing, NA will be introduced. The grep statement is used to get the location of alphabetical characters in our input variable x. There are 2 kinds of alphabetical characters in our input string, one representing name and other representing sex. Both are accordingly assigned in the out vector at 1st and 4th positions. The year of birth and year of death can be recognized as “less than 1890” or “greater than 1890” respectively.

To retrieve the fields for each row in the example, we need to apply this function to every row of fields.

stdfields=lapply(fields, Categorize)

Our Categorize function here is quite fragile, it crashes for eg. when our input vector contains 3 or more character fields. Only the data analyst should determine how generalized should our categorize function be.

5. Transform to Data to the Frame type

First we will copy the elements of list to a matrix which is then coerced into a data-frame.

Here, we have made a matrix by row. The number of rows in the matrix will be same as number of elements in stdfields. There-after we have converted the matrix in data.frame formats.

6. Data Normalization

It is the systematic process to ensure the data structure is suitable or serves the purpose. Here the undesirable characteristics of the data are eliminated or updated to improve the consistency and the quality. The goal of this process is to reduce redundancy, inaccuracy and to organize the data.

String normalization techniques are aimed at transforming a variety of strings to a smaller set of string values which are more easily processed.

  • Remove the white spaces. We can use str_trim function from stringr library.

sapply function will return object of type matrix. So we again recovert it into data-frame type

  • Converting all the letters in the column “Name” to upper-case for standardization

  • Normalize the gender variable We have to normalize the gender variable. It is in 5 different formats. Following are the two ways to normalize gender variable:

One is using ^ operator. This will find the words in the Gender column which begin with m and f respectively.

Following is the code:

There is another method of approximate string matching using string distances. A string distance measures how much 2 strings differ from each other. It measures how many operations are required to turn one string to another. Eg.

So, here there are 2 operations required to convert pqr to qpr

  • replace q: pqr -> ppr
  • replace p: ppr -> qpr

We’ll use adist function in the gender column in the following manner:

First is male – 0 replacement away from “male” and 2 replacements away from “female”.
Second is M is 4 replacements away from “male” and 6 replacements away from “female”.
Third is Female 2 replacements away from “male” and 1 replacement away from “female”.

We can use which.min() function on each row to get minimum distance from the codes.

it contains the column number where the distance is minimum from our codes. If the distance from column number 1 is minimum, then we will substitute gender with code[1], and if the distance from column number 2 is minimum, then we will substitute the gender with code[2]. Following is the code:

  • Normalize the data-types

Let’s retrieve the classes of each column in data-frame at:

We will convert Birth and Death columns to Numeric data-types using transform function.

The data as represented in dat data frame is clean, standardized, categorized and normalized.

Cleaning the Date Variable

When we talk to data cleaning, we can’t miss cleaning and normalizing the date variable. Date variable mostly is generally keyed in by different people in different formats. So, we always face the problem in standardizing it.

In the following example, we’ll standardize and normalize the date variable:

We notice the date variable is in a mixed format. We will first substitute “/” and ” ” in the date variable with “-”, in our attempt to standardize dates.

In the below code, we will make the assumption that month can’t be in the first-2 or last-2 characters. The month in all cases comes in the middle of date and year; or year and date.

Next, we will write a function to find the first-two digits of each date:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[2]=“-”, then first_2 numbers will be 0 and dt[1]
  • We will convert the first_2 characters to numeric type.

Next, we will write a function to find the last-2 digits of each date:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[2]=“-”, then first_2 numbers will be 0 and dt[1]
  • We will convert the first_2 characters to numeric type.

Middle 2 digits can be alphabetic or numeric. We will write the function to find the middle 2 characters. We will adopt following steps to reach the middle characters:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[3]=“-”, then, dt[4:5] are the middle characters. But, if dt[5] is a “-”, then, dt[4] (single digit) is the middle character.
  • If dt[3]!=“-”, then, dt[3:4] are the middle characters. But, if dt[4] is a “-”, then, dt[3] (single digit)is the middle character.
  • If the middle character is a single digit, then we need to append 0 to the single digit, to return a 2 digit middle character.
  • Later in the code, we will evaluate if the middle character is an alphabetic or numeric, and process accordingly.

Next we will initialize variables before applying a for-loop to process and segregate the date, month and year for each date field.

  • df is a dummy data.frame object.
  • datf will be the final data frame object which will comprise of fields dd, mm, year, month format (i.e numeric or alphabetical) and concatenated date in the standard format.
  • snum refers to serial number of each date in the vector “date”.
  • We will keep the default date to be 01, default month to be 01 and default year to be 1990 incase of missing fields in these variables.
  • Mon_Format refers to whether month is in numeric or alphabetical format.
  • cat_date field contains date, month and year concatenated.
  • format_date will contain the cat_date converted to “date” format.

Now, for each date field, we will find the first-two numbers, the last-two numbers and the middle characters. Then we’ll apply following rules:

  • If number of characters in our date is 4, then this represents the year.
  • If first_2>31, then the first-two numbers represent the year.
  • If first_2<31, then, first_2 represents the date.
  • If 2nd number is a “-”, then 1st number represents the single digit date.
  • If last_2>31, then the last-two numbers represent the year.
  • If last_2<31, then the last-two numbers represent the date.
  • If the 2nd last number is a “-”, then the last number represents the single digit date.
  • If middle characters are non-alphabetical (i.e if middle characters are numeric), then mid-num represents the numeric middle characters. Mon_Format will be labelled numeric in such case.
  • If mid-num>31, then mid-num represents the year, else if mid-num<12, then mid-num represents the month.
  • If the date has an alphabetical character, then the first 3 alphabets represent the month. The Mon_Format will be alphabetical in this case.
  • Next we will concatenate the date in following format: dd-mm-yy, where mm can be numeric (with mon_format as “num”) or alphabetic (with mon_format as “alpha”) and store into variable “cat_date”.
  • We will change the format of cat_date to “date” data-type and store in the variable “formatted_date”.

Following is the code:

The vector formatted_date contains the date in the correct format.

The datal.frame datf comprises of all the entries segregated as date, month and year.

We note, that date vector now contains all the dates in the standard format. The class of variable date is “date”.

I would like to conclude this article by:

“Bad Data leads to Bad Decisions”

“Good Data improves Efficient Decisions”

Also read:

Data Exploration and Uni-Variate Analysis
Bi-Variate Analysis