Spam Detection with Natural Language Processing (NLP) – Part 1
Part 1: Data Cleaning and Exploratory Data Analysis
Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages.
When I first began learning NLP, it was difficult for me to process text and generate insights out of it. Before actually diving deep into NLP, I knew some of the basic techniques in NLP before but never could connect them together to view it as an end to end process of generating insights out of the text data.
In this blog, we will try to build a simple classifier using machine learning which will help in identifying whether a given SMS is a spam or not. Parallely, we will also be understanding a few basic components of Natural Language Processing (NLP) for the readers who are new to natural language processing.
Building SMS SPAM Classifier
In this section, we will be building a spam classifier step by step.
Step 1: Importing Libraries
We will be using pandas, numpy and Multinomial naive Bayes classifier for building a spam detector. Pandas will be used for performing operations on data frames. Furthermore using numpy, we will perform necessary mathematical operations.
from sklearn.naive_bayes import MultinomialNB import pandas as pd import numpy as np
Step 2: Reading the dataset and preparing it for basic processing in NLP
First, we read the csv using pandas read_csv function. We then modify the column names for easy references. In this dataset, the target variable is categorical (ham, spam) and we need to convert into a binary variable. Remember, machine learning models always take numbers as input and not the text hence we need to convert all categorical variables into numerical ones.
We replace ham with 0 (meaning not a spam) and spam with 1 (meaning that the SMS is a spam)
## Reading the dataset as a csv file training_dataset = pd.read_csv("spam.csv", encoding="ISO-8859-1") ## Renaming columns training_dataset.columns=["labels","comment"] ## Adding a new column to contain target variable training_dataset["b_labels"] = [0 if x=="ham" else 1 for x in final_data["labels"] ] Y = training_dataset["b_labels"].as_matrix() training_dataset.head()
Step 3: Cleaning Data
Well, Cleaning text is one of the interesting and very important steps before performing any kind of analysis over it. Text from social media and another platform may contain many irregularities in it. People tend to express their feeling while writing and you may end up with words like gooood or goood or goooooooooooood in your dataset. Essentially all are same but we need to regularize this data first. I have made a function below which works fairly well in removing all the inconsistencies from the data.
Clean_data() function takes a sentence as it’s input and returns a cleaned sentence. This function takes care of the following
- Removing web links from the text data as they are not pretty much useful
- Correcting words like poooor and baaaaaad to poor and bad
- Removing punctuations from the text
- Removing apostrophes from the text to correct words like I’m to I am
- Correcting spelling mistakes
Below is the snippet for clean_data function
def clean_data(sentence): ## removing web links s = [ re.sub(r'http\S+', '', sentence.lower())] ## removing words like gooood and poooor to good and poor s = [''.join(''.join(s)[:2] for _, s in itertools.groupby(s[0]))] ## removing appostophes s = [remove_appostophes(s[0])] ## removing punctuations from the code s = [remove_punctuations(s[0])] return s[0]
Function to remove punctuations from the sentence
def remove_punctuations(my_str): punctuations = '''!()-[]{};:'"\,./?@#$%^&@*_~''' no_punct = "" for char in my_str: if char not in punctuations: no_punct = no_punct + char return no_punct
Function to remove apostrophes from the sentences
def remove_appostophes(sentence): APPOSTOPHES = {"s" : "is", "re" : "are", "t": "not", "ll":"will","d":"had","ve":"have","m": "am"} words = nltk.tokenize.word_tokenize(sentence) final_words=[] for word in words: broken_words=word.split("'") for single_words in broken_words: final_words.append(single_words) reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in final_words] reformed = " ".join(reformed) return reformed
Example of using the clean_data function
## Sample Sentence to be cleaned sentence="Goooood Morning! My Name is Joe & I'm going to watch a movie today https://youtube.com. ##" ## Using clean_data function clean_data(sentence) ## Output ## good morning my name is joe i am going to watch a movie today
Now in order to process and clean all the text data in our dataset, we iterate over every text in the dataset and apply the clean_data function to retriever cleaner texts
for index in range(0,len(training_dataset["comment"])): training_dataset.loc[index,"comment"] = clean_data(training_dataset["comment"].iloc[index])
Step 4: Understanding text data and finding Important words
After cleaning our text data, we want to analyze it but how de analyze text data? In the case of numbers, we could have gone with finding out mean, median, standard deviation, and other statistics to understand the data but how do we go about here?
We can not take a whole sentence up and generate meaning from it. Although, we can take words from those sentences and try to find out words that are frequently occurring in the text document or finding out the words which hold relatively higher importance in helping us understand what the complete sentence is about. In case of identifying a message as spam, we need to understand that are there any specific words or sequence of words that determine whether an SMS is a spam or not.
Tokenization and Lemmatization
We start by breaking each sentence into individual words. So a sentence like “Hey, You are awesome” will be broken into individual words into an array [‘hey’, ‘you’, ‘are’, ‘awesome’]. This process is known as tokenization and every single word is known as tokens. After getting each token, we try to get each token into its most basic form. For example, words like studies and goes will become study and go respectively. Also, remember that we need to remove stopwords like I, you, her, him etc as these words are very frequent in the text and hardly lead to any interpretation about any message being a spam or not!
Given below, I have made a tokenizer function which will take each sentence as input. It splits the sentence into individual tokens and then lemmatizes those words. In the end, we remove stop words from the tokens we have and return these tokens as an array.
def my_tokeniser(s): s = clean_data(s) s = s.lower() tokens = nltk.tokenize.word_tokenize(s) tokens = [t for t in tokens if len(t)>2] tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] tokens = [t for t in tokens if t not in stopwords] return tokens
Example showing the working of my_tokeniser function
## Sample Sentence sentence="The car is speeding down the hill" ## Tokenising the sentence my_tokeniser(sentence) ## Output Array: ["car", "speed", "down", "hill"]
Understanding n-grams
An n-gram is a contiguous sequence of n items from a given sequence of text. Given a sentence, s
we can construct a list of n-grams from s finding pairs of words that occur next to each other. For example, given the sentence “I am Kartik” you can construct bigrams (n-grams of length 2) by finding consecutive pairs of words which will be (“I”, “am”), (“am”, “Kartik”).
A consecutive pair of three words is known as tri-grams. This will help us to understand how exactly a sequence of tokens together determines whether an incoming message is a spam or not. In natural language processing (NLP), n-grams hold a lot of importance as they determine how sequences of words affect the meaning of a sentence.
We will be finding out most common bi-grams and tri-grams from the messages we have in the dataset separately for both spam and non-spam messages and consecutively will have a look at most commonly occurring sequences of text in each category.
Code for finding out bi-grams and tri-grams
Below is a python function which takes two input parameters i.e. label and n. The “label” parameter is the target label of the message. For spam messages, it is 1 whereas for non-spam messages it is 0. The “n” parameter is for selecting whether we want to extract bi-grams out or tri-grams out from the sentences. A too much high value for n will not make any sense as long sequences of text are majorly not common throughout the data
def get_grams(label,n): bigrams = [] for sentence in training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"]: tokens = my_tokeniser(sentence) bigrams.append(tokens) bigrams_final=[] bigrams_values=0 bigrams_labels=0 if(n==2): for bigram in bigrams: for i in range(0,len(bigram)-1): bigram_list_basic=bigram[i]+" "+bigram[i+1] bigrams_final.append(bigram_list_basic) else: for bigram in bigrams: for i in range(0,len(bigram)-2): bigram_list_basic=bigram[i]+" "+bigram[i+1]+" "+bigram[i+2] bigrams_final.append(bigram_list_basic) bigrams_final = pd.DataFrame(bigrams_final) bigrams_final.columns=["bigrams"] bigrams_values=bigrams_final.groupby("bigrams")["bigrams"].count() bigrams_labels=bigrams_final.groupby("bigrams").groups.keys() bigrams_final_result = pd.DataFrame( { "bigram":[*bigrams_labels], "count":bigrams_values } ) return bigrams_final_result
We will call the below function to directly plot all the common bigrams or trigrams as a word cloud. This function calls the above function to get all the bi_grams or tri_grams from the messages we have and will then plot it
def plot_grams(sentiment_label,gram_n,height=4, width=14): bigrams_final = get_grams(sentiment_label,n) bigrams_final = bigrams_final.sort_values("count",ascending=False).iloc[:15] plt.barh(bigrams_final["bigram"],bigrams_final["count"], align="center", alpha=0.7) plt.xlabel('Count') plt.title('Most common bigrams') fig_size = plt.rcParams["figure.figsize"] fig_size[0] = width fig_size[1] = height plt.show() plt.rcParams["figure.figsize"] = fig_size
Most frequent words in spam messages
visualise_word_map(label=0)
Most frequent words in non-spam messages
visualise_word_map(label=1)
Top 15 frequent bigrams for non-spam messages
plot_grams(spam_label=0, gram_n=2)
Top 15 frequent bigrams for spam messages
plot_grams(spam_label=1, gram_n=2)
Visualizing most frequent trigrams for non-spam messages
plot_grams(spam_label=0, gram_n=3)
Visualizing most frequent trigrams for spam messages
plot_grams(spam_label=1, gram_n=3)
Conclusion
Till now we have learned how to start with cleaning and understanding data. This process needs to be done before any kind of text analysis. One should always start with cleaning the text and then move on to fetch tokens out of the text. Getting tokens out of the text also requires to exclude stop words. Also, we need to get all the other words into their basic morphological form using lemmatization. In the next blog, we will have a look at finding out important words from the text data. We will also learn the word embeddings. In the end, we will finally build a classifier to segregate spam SMS out.
Stay tuned!