The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.
Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.
You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.
This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence.
The Nearest Neighbours topic can be divided into the following sub-topics:
- Brute-Force Search
- KD-Trees
- Ball-Trees
- K-Nearest Neighbours
Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.
K-Nearest Neighbours (KNNs)
A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/ – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).
Implementation
The algorithm of a KNN ML model is given below:
Again, mathematical! Let’s break it into small steps one at a time:
How the Algorithm Works
This explanation is for supervised learning binary classification.
Here we have two classes. We’ll call them A and B.
So the dataset is a collection of values which belong either to class A or class B.
A visual plot of the (arbitrary) data might look something like this:
Now, look at the star data point in the centre. To which class does it belong? A or B?
The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.
They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).
The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).
Finally, we return a class as output which is closest to the new data point, according to various measures. The measures used include Euclidean distance among others.
This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.
Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.
Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on. Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.
Obtain the Data and Preprocess it
We shall use the data from the UCI Repository, available at the following link:
http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks)
It needs to be manually converted into a CSV file, which is available at the following link:
https://github.com/selva86/datasets/blob/master/Sonar.csv
This data is a set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:
Import the required packages first:
import numpy as np import pandas as pd import scipy as sp from datetime import datetime from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics.classification import accuracy_score from sklearn.metrics.classification import confusion_matrix from sklearn.metrics.classification import classification_report
Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function.
> df = pd.read_csv("sonar.all-data.csv") df.head() 0.0200 0.0371 0.0428 0.0207 0.0954 ... 0.0180 0.0084 0.0090 0.0032 R 0 0.0453 0.0523 0.0843 0.0689 0.1183 ... 0.0140 0.0049 0.0052 0.0044 R 1 0.0262 0.0582 0.1099 0.1083 0.0974 ... 0.0316 0.0164 0.0095 0.0078 R 2 0.0100 0.0171 0.0623 0.0205 0.0205 ... 0.0050 0.0044 0.0040 0.0117 R 3 0.0762 0.0666 0.0481 0.0394 0.0590 ... 0.0072 0.0048 0.0107 0.0094 R 4 0.0286 0.0453 0.0277 0.0174 0.0384 ... 0.0057 0.0027 0.0051 0.0062
Check how much data you have and what its dimensions are:
> df.describe() 0.0200 0.0371 ... 0.0090 0.0032 count 207.000000 207.000000 ... 207.000000 207.000000 mean 0.029208 0.038443 ... 0.007936 0.006523 std 0.023038 0.033040 ... 0.006196 0.005038 min 0.001500 0.000600 ... 0.000100 0.000600 25% 0.013300 0.016400 ... 0.003650 0.003100 50% 0.022800 0.030800 ... 0.006300 0.005300 75% 0.035800 0.048100 ... 0.010350 0.008550 max 0.137100 0.233900 ... 0.036400 0.043900
> df.shape (207, 61)
Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:
#Inputs (data values) sonar readings from an underground submarine. Cool! X = df.values[:,0:-1].astype(float) # Convert classes M (Mine) and R to numbers, since they're categorical values le = LabelEncoder() #Classification target target = df.R # Do conversion le = LabelEncoder.fit(le, y = ["R", "M"]) y = le.transform(target) ›
Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.
y array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Examine your pre-scaled input numpy array, X:
X array([[0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044], [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078], [0.01 , 0.0171, 0.0623, ..., 0.0044, 0.004 , 0.0117], ..., [0.0522, 0.0437, 0.018 , ..., 0.0138, 0.0077, 0.0031], [0.0303, 0.0353, 0.049 , ..., 0.0079, 0.0036, 0.0048], [0.026 , 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]])
Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Fit the KNN classifier to the dataset.
#Train kneighbors classifier from sklearn.neighbors.classification import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors = 5, metric = "minkowski", p = 1) # Fit the model clf.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=1, weights='uniform')
As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 – x2| + |y1 – y2|. (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)
Output the statistical scoring of this classification model.
predicted = clf.predict(X_test) print("Accuracy:") print(accuracy_score(y_test, predicted)) print("Confusion Matrix:") print(confusion_matrix(y_test, predicted)) print("Classification Report:") print(classification_report(y_test, predicted)) Accuracy: 0.8253968253968254 Confusion Matrix: [[32 3] [ 8 20]] Classification Report: precision recall f1-score support 0 0.80 0.91 0.85 35 1 0.87 0.71 0.78 28 avg / total 0.83 0.83 0.82 63
Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!
The entire program as source code in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):
[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/11/sonar.txt” download=”all” viewer=”google”]
To learn more about how k-nearest neighbours are used in practice, do check out the following excellent article on our blog:
https://dimensionless.in/spam-detection-with-natural-language-processing-part-3/
The following article is also an excellent reference for KNNs:
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Takeaways
K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.
Finally, Understand the Power of Machine Learning
Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!
Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.
And this is just the beginning!
Automation is the Future
As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already. Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.