9923170071 / 8108094992 info@dimensionless.in

Spam Detection with Natural Language Processing-Part 2

Understanding TF-IDF and Word Embeddings

Related image

In the last blog, we had a look over visualizing text data and understood some basic concepts of tokenization and lemmatization. We wrote python function to perform all the operations for us. If you are directly jumping to this blog, I will highly recommend you to go through the previous blog post in which we have discussed the problem statement and some founding concepts of NLP.

We will be covering the following topics

  1. Understanding Tf-IDF
  2. Finding Important words using Tf-IDF
  3. Understanding Bag of Words
  4. Understanding Word Embedding
  5. Different Types of word embeddings
  6. Difference between word embeddings and Bag of words model
  7. Preparing a word embedding for SPAM classifier

Introduction

Previously, we found out the most occurring/common words, bigrams, and trigrams from the messages separately for spam and non-spam messages. Now we need to also find out some important words that can themselves define whether a message is a spam or not. Take a note here that most occurring/common word in a set of messages may not be a keyword that determines what the entire sentence is all about.

For example, in a business article words like business, investment, acquisition are important words that may relate a sentence to a business article. Other words like money, good, building etc may be the frequent words in the messages but they do not have much relevant information to provide.

To find the important words, we will be using the method known as Term Frequency-Inverse Document Frequency (TF-IDF)

What is TF-IDF?

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.

TF means Term Frequency. It measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization.

TF = (Number of times term w appears in a document) / (Total number of terms in the document)

Second part idf stands for Inverse Document Frequency. It measures how important a term is. While computing TF, all terms are equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones.

IDF =  log_e(Total number of documents / Number of documents with term w in it)

We calculate a final tf-idf score by multiplying TF score with IDF score for every word and then finally, we can filter out important words by selecting words with a higher Tf-Idf score.

Code Implementation

An example to calculate Tf-idf score for different words

Finding Important words using Tf-IDF

Now we will need to find out which are the most important words in both spam and non-spam messages and then we will have a look at those words in the form of the word cloud. We will analyse those words and that will help us to relate why a particular message has been marked as a spam and other as a non-spam message.

First, we import the necessary libraries. Then I have a written a function that returns a TF-IDF score for all words in the corpus

Then we need to map all the scores to the words in the corpus in order to find the most important words

Finally, we plot all the important words in the form of a word cloud

Plotting Important words for non-spam messages


Plotting Important words for non-spam messages

Understanding Bag of Words

We need a way to represent text data for the machine learning algorithm and the bag-of-words model helps us to achieve that task. The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

Vocabulary can be attained by tokenising the messages into different unique tokens. After getting each token, we need to score that token. This can be done in the following ways

  • Counts. Count the number of times each word appears in a document.
  • Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
  • TF-IDF : TF score * IDF score

How BoW works

Forming the vector

Take for example 2 text samples: The quick brown fox jumps over the lazy dogand.Never jump over the lazy dog quickly

The corpus(text samples) then form a dictionary:

Vectors are then formed to represent the count of each word. In this case, each text (i.e. the sentences) will generate a 10-element vector like so:

Each element represents the number of occurrence for each word in the corpus(text sample). So, in the first sentence, there is 1 count for “brown”, 1 count for “dog”, 1 count for “fox” and so on (represented by the first array). Whereas, the vector shows that there is 0 count of “brown”, 1 count for “dog” and 0 counts for “fox”, so on and so forth

Understanding Word Vectors

Word vectors are simply vectors of numbers that represent the meaning of a word.

Traditional approaches to NLP, such as one-hot encodings, do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way.

Word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric space. In simpler terms, a word vector is a row of real-valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant.

A simple representation of word vectors

Now we will look at an example of using word vectors where we will group words of similar semantics together

Preparing a bag of words model for Analysis

Below is the code snippet for converting our messages into a table which has numerical word vectors. After achieving this only, we can build our classifier using machine learning since machine learning always needs numerical inputs!

Conclusion and Further steps

Till now we have learnt to perform EDA over text data. We have also learnt about important terms in NLP like tokenization, lemmatization, stop-words, tf-idf, the bag of words, and word-vectors. These terms are essential to master NLP. After having out word embedding ready, we will proceed to actually build machine learning models. They will help us to predict whether a message is a spam or not. In the next blog, we will build machine learning and neural network models and compare their performance. We will understand shortcomings of the neural net in the case of text mining. Finally, we will move to recurrent neural networks and LSTM to wrap up the series!

Click Here for Part 1 of the article.

Stay tuned!

Spam Detection with Natural Language Processing (NLP) – Part 1

     Part 1: Data Cleaning and Exploratory Data Analysis

Spam detection with NLP

Predicting whether an SMS is a spam

 

Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages.

When I first began learning NLP, it was difficult for me to process text and generate insights out of it. Before actually diving deep into NLP, I knew some of the basic techniques in NLP before but never could connect them together to view it as an end to end process of generating insights out of the text data.

In this blog, we will try to build a simple classifier using machine learning which will help in identifying whether a given SMS is a spam or not. Parallely, we will also be understanding a few basic components of Natural Language Processing (NLP) for the readers who are new to natural language processing.

Building SMS SPAM Classifier

In this section, we will be building a spam classifier step by step.

Step 1: Importing Libraries

We will be using pandas, numpy and Multinomial naive Bayes classifier for building a spam detector. Pandas will be used for performing operations on data frames. Furthermore using numpy, we will perform necessary mathematical operations.

Step 2: Reading the dataset and preparing it for basic processing in NLP

First, we read the csv using pandas read_csv function. We then modify the column names for easy references. In this dataset, the target variable is categorical (ham, spam) and we need to convert into a binary variable. Remember, machine learning models always take numbers as input and not the text hence we need to convert all categorical variables into numerical ones.

We replace ham with 0 (meaning not a spam) and spam with 1 (meaning that the SMS is a spam)

Step 3: Cleaning Data

Well, Cleaning text is one of the interesting and very important steps before performing any kind of analysis over it. Text from social media and another platform may contain many irregularities in it. People tend to express their feeling while writing and you may end up with words like gooood or goood or goooooooooooood in your dataset. Essentially all are same but we need to regularize this data first. I have made a function below which works fairly well in removing all the inconsistencies from the data.

Clean_data() function takes a sentence as it’s input and returns a cleaned sentence. This function takes care of the following

  1. Removing web links from the text data as they are not pretty much useful
  2. Correcting words like poooor and baaaaaad to poor and bad
  3. Removing punctuations from the text
  4. Removing apostrophes from the text to correct words like I’m to I am
  5. Correcting spelling mistakes

Below is the snippet for clean_data function

Function to remove punctuations from the sentence

Function to remove apostrophes from the sentences

Example of using the clean_data function

Now in order to process and clean all the text data in our dataset, we iterate over every text in the dataset and apply the  clean_data function to retriever cleaner texts

Step 4: Understanding text data and finding Important words

After cleaning our text data, we want to analyze it but how de analyze text data? In the case of numbers, we could have gone with finding out mean, median, standard deviation, and other statistics to understand the data but how do we go about here?

We can not take a whole sentence up and generate meaning from it. Although, we can take words from those sentences and try to find out words that are frequently occurring in the text document or finding out the words which hold relatively higher importance in helping us understand what the complete sentence is about. In case of identifying a message as spam, we need to understand that are there any specific words or sequence of words that determine whether an SMS is a spam or not.

Tokenization and Lemmatization

We start by breaking each sentence into individual words. So a sentence like “Hey, You are awesome” will be broken into individual words into an array [‘hey’, ‘you’, ‘are’, ‘awesome’]. This process is known as tokenization and every single word is known as tokens. After getting each token, we try to get each token into its most basic form. For example, words like studies and goes will become study and go respectively. Also, remember that we need to remove stopwords like I, you, her, him etc as these words are very frequent in the text and hardly lead to any interpretation about any message being a spam or not!

Given below, I have made a tokenizer function which will take each sentence as input. It splits the sentence into individual tokens and then lemmatizes those words. In the end, we remove stop words from the tokens we have and return these tokens as an array.

Example showing the working of my_tokeniser function

Understanding n-grams

An n-gram is a contiguous sequence of n items from a given sequence of text. Given a sentence, swe can construct a list of n-grams from s finding pairs of words that occur next to each other. For example, given the sentence “I am Kartik” you can construct bigrams (n-grams of length 2) by finding consecutive pairs of words which will be (“I”, “am”), (“am”, “Kartik”).

A consecutive pair of three words is known as tri-grams. This will help us to understand how exactly a sequence of tokens together determines whether an incoming message is a spam or not. In natural language processing (NLP), n-grams hold a lot of importance as they determine how sequences of words affect the meaning of a sentence.

We will be finding out most common bi-grams and tri-grams from the messages we have in the dataset separately for both spam and non-spam messages and consecutively will have a look at most commonly occurring sequences of text in each category.

Code for finding out bi-grams and tri-grams

Below is a python function which takes two input parameters i.e. label and n. The “label” parameter is the target label of the message. For spam messages, it is 1 whereas for non-spam messages it is 0. The “n” parameter is for selecting whether we want to extract bi-grams out or tri-grams out from the sentences. A too much high value for n will not make any sense as long sequences of text are majorly not common throughout the data

We will call the below function to directly plot all the common bigrams or trigrams as a word cloud. This function calls the above function to get all the bi_grams or tri_grams from the messages we have and will then plot it

Most frequent words in spam messages

word cloud result of spam detection NLP model - 2

Most frequent words in non-spam messages

word cloud result of spam detection NLP model

Top 15 frequent bigrams for non-spam messages

result of ML algorithm - 1

Top 15 frequent bigrams for spam messages

result of ML algorithm - 2

Visualizing most frequent trigrams for non-spam messages

result of ML algorithm - 3

Visualizing most frequent trigrams for spam messages

result of ML algorithm - 4
Conclusion

Till now we have learned how to start with cleaning and understanding data. This process needs to be done before any kind of text analysis. One should always start with cleaning the text and then move on to fetch tokens out of the text. Getting tokens out of the text also requires to exclude stop words. Also, we need to get all the other words into their basic morphological form using lemmatization. In the next blog, we will have a look at finding out important words from the text data. We will also learn the word embeddings. In the end, we will finally build a classifier to segregate spam SMS out.

Stay tuned!