9923170071 / 8108094992 info@dimensionless.in

Spam Detection with Natural Language Processing-Part 2

Understanding TF-IDF and Word Embeddings

Related image

In the last blog, we had a look over visualizing text data and understood some basic concepts of tokenization and lemmatization. We wrote python function to perform all the operations for us. If you are directly jumping to this blog, I will highly recommend you to go through the previous blog post in which we have discussed the problem statement and some founding concepts of NLP.

We will be covering the following topics

  1. Understanding Tf-IDF
  2. Finding Important words using Tf-IDF
  3. Understanding Bag of Words
  4. Understanding Word Embedding
  5. Different Types of word embeddings
  6. Difference between word embeddings and Bag of words model
  7. Preparing a word embedding for SPAM classifier

Introduction

Previously, we found out the most occurring/common words, bigrams, and trigrams from the messages separately for spam and non-spam messages. Now we need to also find out some important words that can themselves define whether a message is a spam or not. Take a note here that most occurring/common word in a set of messages may not be a keyword that determines what the entire sentence is all about.

For example, in a business article words like business, investment, acquisition are important words that may relate a sentence to a business article. Other words like money, good, building etc may be the frequent words in the messages but they do not have much relevant information to provide.

To find the important words, we will be using the method known as Term Frequency-Inverse Document Frequency (TF-IDF)

What is TF-IDF?

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.

TF means Term Frequency. It measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization.

TF = (Number of times term w appears in a document) / (Total number of terms in the document)

Second part idf stands for Inverse Document Frequency. It measures how important a term is. While computing TF, all terms are equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones.

IDF =  log_e(Total number of documents / Number of documents with term w in it)

We calculate a final tf-idf score by multiplying TF score with IDF score for every word and then finally, we can filter out important words by selecting words with a higher Tf-Idf score.

Code Implementation

An example to calculate Tf-idf score for different words

Sentences = ["Ironman movie is really good. Ironman is always my favourite", "Titanic movie is very boring","Thor movie is really good"]

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(Sentences)
pd.DataFrame(features.todense(),columns=tfidf.get_feature_names())

Finding Important words using Tf-IDF

Now we will need to find out which are the most important words in both spam and non-spam messages and then we will have a look at those words in the form of the word cloud. We will analyse those words and that will help us to relate why a particular message has been marked as a spam and other as a non-spam message.

First, we import the necessary libraries. Then I have a written a function that returns a TF-IDF score for all words in the corpus

from gensim.models import TfidfModel 
from gensim.corpora import Dictionary
from gensim import corpora
from gensim import models

def get_tfidf_matrix(documents): 
    documents=[my_tokeniser(document) for document in documents]
    dictionary = corpora.Dictionary(documents)
    n_items = len(dictionary)
    corpus = [dictionary.doc2bow(text) for text in documents]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    return corpus_tfidf 

Then we need to map all the scores to the words in the corpus in order to find the most important words

def get_tfidf_score_dataframe(sentiment_label):
    frames = get_tfidf_matrix(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"])
    all_score=[]
    all_words=[]
    sentence_count=0
    for frame in frames:
        words=my_tokeniser(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"].iloc[sentence_count])
        sentence_count=sentence_count+1
        for i in range(0,len(frame)):
            all_score.append(frame[i])
            all_words.append(words[i])
    tf_idf_frame=pd.DataFrame({
        'Words': all_words,
        'Score': all_score
    })
    count=0
    for key, val in tf_idf_frame["Score"]:
        tf_idf_frame["Score"][count] = val
        count=count+1
    return tf_idf_frame

Finally, we plot all the important words in the form of a word cloud

def plot_tf_idf_wordcloud(sentiment_label):
    tf_idf_frame = get_tfidf_score_dataframe(sentiment_label)
    sorted_tf_idf_frame=tf_idf_frame.sort_values("Score", ascending=False)
    important_negative_words=sorted_tf_idf_frame[sorted_tf_idf_frame["Score"]==1]["Words"].unique()
    comment_words=''
    for words in important_negative_words: 
        comment_words = comment_words + words + ' '
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    stopwords = stopwords, 
                    min_font_size = 10).generate(comment_words)                       
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

Plotting Important words for non-spam messages

plot_tf_idf_wordcloud(label=0)


Plotting Important words for non-spam messages

plot_tf_idf_wordcloud(label=1)

Understanding Bag of Words

We need a way to represent text data for the machine learning algorithm and the bag-of-words model helps us to achieve that task. The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

Vocabulary can be attained by tokenising the messages into different unique tokens. After getting each token, we need to score that token. This can be done in the following ways

  • Counts. Count the number of times each word appears in a document.
  • Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
  • TF-IDF : TF score * IDF score

How BoW works

Forming the vector

Take for example 2 text samples: The quick brown fox jumps over the lazy dogand.Never jump over the lazy dog quickly

The corpus(text samples) then form a dictionary:

{
    'brown': 0,
    'dog': 1,
    'fox': 2,
    'jump': 3,
    'jumps': 4,
    'lazy': 5,
    'never': 6,
    'over': 7,
    'quick': 8,
    'quickly': 9,
    'the': 10,
}

Vectors are then formed to represent the count of each word. In this case, each text (i.e. the sentences) will generate a 10-element vector like so:

[1,1,1,0,1,1,0,1,1,0,2]
[0,1,0,1,0,1,1,1,0,1,1]

Each element represents the number of occurrence for each word in the corpus(text sample). So, in the first sentence, there is 1 count for “brown”, 1 count for “dog”, 1 count for “fox” and so on (represented by the first array). Whereas, the vector shows that there is 0 count of “brown”, 1 count for “dog” and 0 counts for “fox”, so on and so forth

Understanding Word Vectors

Word vectors are simply vectors of numbers that represent the meaning of a word.

Traditional approaches to NLP, such as one-hot encodings, do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way.

Word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric space. In simpler terms, a word vector is a row of real-valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant.

A simple representation of word vectors

Now we will look at an example of using word vectors where we will group words of similar semantics together

import numpy as np
import spacy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
nlp = spacy.load("en")
sentence = "Tiger was driving a car when he saw a fox taking the charge on a bike but in the end giraffe won the race using his aircraft"
tokens = nlp(sentence)
vectors = np.vstack([word.vector for word in tokens if word.has_vector])
pca = PCA(n_components=2)
vecs_transformed = pca.fit_transform(vectors)
vecs_transformed = np.c_[sentence.split(), vecs_transformed]
plt.figure(figsize = (16, 10), facecolor = None) 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
d = pd.DataFrame(vecs_transformed)
d.columns=["Name","V1", "V2"]
v1 = [float(x) for x in d['V1']]
v2 = [float(x) for x in d['V2']]
plt.scatter(v1, v2)
for i, txt in enumerate(d['Name']):
plt.annotate(txt, (v1[i], v2[i]))
plt.show()

Preparing a bag of words model for Analysis

Below is the code snippet for converting our messages into a table which has numerical word vectors. After achieving this only, we can build our classifier using machine learning since machine learning always needs numerical inputs!

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
## Ouput 
print(X)
## <5572x8672 sparse matrix of type '<class 'numpy.float64'>'
##	with 73916 stored elements in Compressed Sparse Row format>

Conclusion and Further steps

Till now we have learnt to perform EDA over text data. We have also learnt about important terms in NLP like tokenization, lemmatization, stop-words, tf-idf, the bag of words, and word-vectors. These terms are essential to master NLP. After having out word embedding ready, we will proceed to actually build machine learning models. They will help us to predict whether a message is a spam or not. In the next blog, we will build machine learning and neural network models and compare their performance. We will understand shortcomings of the neural net in the case of text mining. Finally, we will move to recurrent neural networks and LSTM to wrap up the series!

Click Here for Part 1 of the article.

Stay tuned!

Machine Learning and Deep Learning : Differences

Are you intrigued by buzzwords Machine Learning and Deep learning but you have always found them to be ambiguous and often used interchangeably?

Machine learning vs Deep Learning

 

 

 

 

 

 

 

If yes, you are at right platform.

Let’s discuss the terms Machine Learning (ML) and Deep Learning (DL) and understand the subtle differences between them.

The formal definition of ML given by Arthur Samuel says “It provides computers with the capability to learn and take decisions without being explicitly programmed”.

Applications of ML includes Fraud Detection, Netflix Movie Recommendation etc. whereas Deep Learning can be defined as “Advanced Subset of Machine Learning” in which neural networks adapts and learns from vast amounts of data. It can be used to solve complex real-world problems such as self-driving cars, cancer detection. Let’s discuss each of them in detail.

ML & Deep learning difference

What is Machine Learning?

As the name suggests, Machine Learning is all about the machine that learns. The question here is How do they learn?  Machine Learning uses a mathematical function to construct a model based on training data which is then used to make predictions for the unknown data.

ML can be applied to a variety of domains such as finance, HR, Aerospace, pharmaceutical etc. There is a huge number of sophisticated algorithms available today to train the computers depending on the business problem. Some of them are Linear Regression, Logistic Regression, Random Forests, Support Vector Machines, neural networks.  When there is a humongous amount of data available, the most intricate part is to select the correct algorithm to solve the problem. Each model has its own pros and cons and should be selected depending on the type of problem at hand and data available.  We will not go into nitty gritty of each one of them.

Let’s try to build the predictive model for the HR department of XYZ company to understand the concept of Machine Learning in a better way. The aim of the model is to predict the number of employees who will leave the company in next five years based on factors such as Work satisfaction, Salary Increment, Number of hours spent in the office, promotion rate, last evaluation etc. The model also predicts the major cause due to which employees are leaving the company. In this way, the machine learning model will help the company to take the best measures to retain their employees in the next five years.

Neural Networks

One of the most important concepts of Machine Learning is Neural Networks. Let’s talk a bit about the Neural Network to go one step ahead towards the term “Deep Learning”. The idea of neural networks evolved from the dream to develop the algorithms that try to mimic the human brain. The simplest definition of a neural network is provided by Dr. Robert Hecht-Nielsen says a computing system made up of number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.

Why we need Neural Networks?

ML algorithms such as Linear Regression and Logistic Regression becomes convoluted while learning complex non-linear hypothesis. Above algorithms hold good when the number of features is less which is a rare case while solving real-world ML problems.  Usually, the number of features in ML problems are more than 100 so we end up having the algorithm with an order of 5000 or more for a quadratic polynomial. It becomes computationally expensive. Moreover, the problem of overfitting arises because of which hypothesis presented by these algorithms couldn’t be generalized for new inputs. In such cases, these machine learning algorithms could be destructive.

Let’s understand the big picture of Neuron model:

 

Machine Learning Model

Machine Learning Mode

Deep Learning Model

Deep Learning Mode

 

 

 

 

 

 

 

 

 

Neuron Model:

 

The neuron model basically consists of 3 layers: an Input layer, Hidden Layer and Output Layer. Training data is provided to the input layer. Hidden layer is the computational unit which is called as “Sigmoid activation function”. Computation unit is nothing but the model which takes the input from the input layer and develops the hypothesis which is channeled down to the output layer. Aggregation of a large number of neuron models collectively known as Neural Networks.

Let’s refer to the model which predicts the number of employees who will leave the company in the next five years. Machine Learning algorithms such as Linear Regression, Logistic Regression holds good in this case because we have a limited number of features such as Work satisfaction, Salary increment, and limited training data available. But if we add the hundred more features such as department, Time spent in company, work accidents. Addition of these features will make the model nonlinear which can’t be solved by above-stated algorithms. This limitation can be overcome by the implementation of neural networks.

The complexity of the problem increases with a tremendous number of features and training data coming into picture due to which the efficiency of neural networks starts degrading. To improve the performance and capabilities of neural networks, the number of hidden layers in a neural network is increased. That’s where the Buzzword, “Deep Learning” comes into the picture.

 Deep Learning Models:

 

Artificial Neural Networks are Machine Learning algorithms with one or two hidden layers whereas Deep Learning models consist of multiple hidden layers which help to improve the state of art technology to great extent. Addition of multiple hidden layers will make the network deep and is called Deep Learning.

Another major difference between Machine Learning and Deep Learning is Feature abstraction for a model. Domain expertise is critical in machine learning algorithm since a lot of preprocessing is required to clean the data and extracting the useful features which can be used to train the Machine Learning model. On the contrary, Deep Learning models learn in a more structured way to extract the features from raw data. As discussed in the above point, DL models consist of multiple hidden layers. Each hidden layer is used to identify one unique feature. DL models cut down the time consuming and arduous task of feature extraction.

There are a huge number of ML algorithms available which can be applied in a wide range of domains. Some of them are K means clustering, K nearest neighbors etc. The use of algorithms varies widely in ML depending on the application whereas, in the case of DL, the same piece of software can be trained for language translation as well as voice cloning. It all depends on the type of data that is fed to computers.

Let’s talk about real world example “Language Translators” to compare the efficiency of ML and DL models. Google launched the Chinese to English translator which used to translate the sentences phrase by phrase. This is far away from how humans translate. The efficiency of a model is around 78%. An upgraded version of language translators was launched recently which was based on Deep Learning model. This language translator model translates sentence by sentence rather than phrase by phrase which increased the efficiency of the model to 91%. DL models have also helped to increase the efficiency in areas of Image Recognition. The efficiency of ML based model for image recognition is 89% which increased up to 93% in the case of Deep Learning model.

The high efficiency comes at expense of stringent requirements:

  1. Requirement of a huge amount of data to train the neural network with multiple layers.
  2. Computationally expensive to train large scale neural networks.
  3. It takes hours to train Deep Learning model since they process a large amount of data to parameterize the model.

Although the above requirements can be easily met with technological advancements, it is very important to choose the optimal model depending upon business problem and data available. Otherwise, DL models could be overkill for trivial application such as detection of spam email or movie recommendations.