9923170071 / 8108094992 info@dimensionless.in
KNNs (K-Nearest-Neighbours) in Python

KNNs (K-Nearest-Neighbours) in Python

The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.

Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.

You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.

This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence. 

The Nearest Neighbours topic can be divided into the following sub-topics:

  1. Brute-Force Search
  2. KD-Trees
  3. Ball-Trees
  4. K-Nearest Neighbours

Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.

K-Nearest Neighbours (KNNs)

A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/  – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).

Implementation

The algorithm of a KNN ML model is given below:

K-Nearest Neighbours

Again, mathematical! Let’s break it into small steps one at a time:

How the Algorithm Works

This explanation is for supervised learning binary classification.

Here we have two classes. We’ll call them A and B.

So the dataset is a collection of values which belong either to class A or class B.

A visual plot of the (arbitrary) data might look something like this:

Now, look at the star data point in the centre. To which class does it belong? A or B?

The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.

They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).

The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).

Finally, we return a class as output which is closest to the new data point, according to various measures.  The measures used include Euclidean distance among others.

This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.

Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.

Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on.  Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.

Obtain the Data and Preprocess it

We shall use the data from the UCI Repository, available at the following link:

http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) 

It needs to be manually converted into a CSV file, which is available at the following link:

https://github.com/selva86/datasets/blob/master/Sonar.csv

This data is a  set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:

Import the required packages first:

Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function. 

Sonar Reading for Classification ML Problem

Check how much data you have and what its dimensions are:

Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:

Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.

Examine your  pre-scaled input numpy array, X:

Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into  the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.

Fit the KNN classifier to the dataset.

As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 – x2| + |y1 – y2|. (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)

Output the statistical scoring of this classification model.

Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!

The entire program as source code  in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):

[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/11/sonar.txt” download=”all” viewer=”google”]

To learn more about how k-nearest neighbours are used in practice, do check out the following excellent article on our blog:

https://dimensionless.in/spam-detection-with-natural-language-processing-part-3/ 

The following article is also an excellent reference for KNNs:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm 

Takeaways

K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.

Finally, Understand the Power of Machine Learning

Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!

Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.

And this is just the beginning!

Automation is the Future

As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already.  Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.

Spam Detection with Natural Language Processing – Part 3

Spam Detection with Natural Language Processing – Part 3

Building spam detection classifier using Machine learning and Neural Networks

Introduction

On our path of building an SMS SMAP classifier, we have till now converted our text data into a numeric form with help of a bag of words model. Using TF-IDF approach, we have now numeric vectors that describe our text data.

In this blog, we will be building a classifier that will help us to identify whether an incoming message is a spam or not. We will be using both machine learning and neural network approach to accomplish building classifier. If you are directly jumping to this blog then I will recommend you to go through part 1 and part 2 of building SPAM classifier series. Data used can be found here

Assessing the problem

Before jumping to machine learning, we need to identify what do we actually wish to do! We need to build a binary classifier which will look at a text message and will tell us whether that message is a spam or not. So we need to pick up those machine learning models which will help us to perform a classification task! Also note that this problem is a case of binary classification problem, as we have only two output classes into which texts will be classified by our model (0 – Message is not a spam, 1- Message is a spam)

We will build 3 machine learning classifiers namely SVM, KNN, and Naive Bayes! We will be implementing each of them one by one and in the end, have a look at the performance of each

Building an SVM classifier (Support Vector Machine)

A Support Vector Machine (SVM) is a discriminative classifier which separates classes by forming hyperplanes. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space, this hyperplane is a line dividing a plane into two parts wherein each class lay in either side.

Till now, we have trained our model on the training dataset and have evaluated on a test set ( a data which our model has not seen ever). We have also performed a cross-validation over the classifier to make sure over trained model is free from any bias and variance issues!

Our SVM model with the linear kernel on this data will have a mean accuracy of 97.61% with 0.85 standard deviations. Cross-validation is important to tune the parameters of the model. In this case, we will select different kernels available with SVM and find out the best working kernel in terms of accuracy. We have reserved a separate test set to measure how well the tuned model is working on the never seen before data points.

Building a KNN classifier (K- nearest neighbor)

K-Nearest Neighbors (KNN) is one of the simplest algorithms which we use in Machine Learning for regression and classification problem. KNN algorithms use data and classify new data points based on similarity measures (e.g. distance function). Classification is done by a majority vote to its neighbors. The data is assigned to the class which has the most number of nearest neighbors. As you increase the number of nearest neighbors, the value of k, accuracy might increase.

Below is the code snippet for  KNN classifier

Building a Naive Bayes Classifier

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

  • “P” is the symbol to denote probability.
  • P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
  • P(B | A) = The probability of event B (evidence) occurring given that A (hypothesis) has occurred.
  • P(A) = The probability of event B (hypothesis) occurring.
  • P(B) = The probability of event A (evidence) occurring.

Below is the code snippet for multinomial Naive Bayes classifier

Evaluating the performance of our 3 classifiers

We have till now implemented 3 classification algorithms for finding out the SPAM messages

  1. SVM (Support Vector Machine)
  2. KNN (K nearest neighbor)
  3. Multinomial Naive Bayes

SVM, with the highest accuracy (97%), looks like the most promising model which will help us to identify SPAM messages. Anyone can say this by just looking at the accuracy right? But this may not be the actual case. In the case of classification problems, accuracy may not be the only metric you may want to have a look at. Feeling confused? I am sure you will be and allow me to introduce you to our friend Confusion Matrix which will eventually sort all your confusion out

Confusion Matrix

A confusion matrix, also known as error matrix, is a table which we use to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

A sample confusion matrix for 2 classes

Definition of the Terms:

• Positive (P): Observation is positive (for example: is a SPAM).
• Negative (N): Observation is not positive (for example: is not a SPAM).
• True Positive (TP): Observation is positive, and the model predicted positive.
• False Negative (FN): Observation is positive, but the model predicted negative.
• True Negative (TN): Observation is negative, and the model predicted negative.
• False Positive (FP): Observation is negative, but the model predicted positive.

Let us bring two other metrics apart from accuracy which will help us to have a better look at our 3 models

Recall:

The recall is the ratio of the total number of correctly classified positive examples divided to the total number of positive examples. High Recall indicates the class is correctly recognized (small number of FN).

Precision:

To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labelled as positive is indeed positive (small number of FP).

Let us have a look at the confusion matrix of our SVM classifier and try to understand it. Consecutively, we will be summarising confusion matrix of all our 3 classifiers

Given below is the confusion matrix of the results which our SVM model has predicted on the test data. Let us find out accuracy, precision and recall in this case.

Accuracy = (1446+204)/(1446+3+19+204) = 1650/1672 = 0.986 i.e 98% Accuracy

Recall = (204)/(204+19) = 204/223 = 0.9147 i.e. 91.47% Recall

Precision = (204)(204+3) = 204/207 = 0.985 i.e 98.5% Precision

Understanding the ROC Curve

In Machine Learning, performance measurement is an essential task. So when it comes to a classification problem, we can count on an AUC – ROC Curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics)

AUC – ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

We plot a ROC curve with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

Plotting RoC curves for SVM classifier

Let us have a look at the ROC curve of our SVM classifier

Always remember that the closer AUC (Area under the curve) is to value 1, the better the classification ability of the classifier. Furthermore, let us also have a look at the ROC curve of our KNN and Naive Bayes classifier too!

The graph on the left is for KNN and on the right is for Naive Bayes classifier. This clearly indicates that Naive Bayes classifier, in this case, is much more efficient than our KNN classifier as it has a higher AUC value!

Conclusion

In this series, we looked at understanding NLP from scratch to building our own SPAM classifier over text data. This is an ideal way to start learning NLP as it covers basics of NLP, word embeddings and numeric representations of text data and modeling over those numeric representations. You can also try neural networks for NLP as they are able to achieve good performance! Stay tuned for more on NLP in coming blogs.