9923170071 / 8108094992 info@dimensionless.in
KNNs (K-Nearest-Neighbours) in Python

KNNs (K-Nearest-Neighbours) in Python

The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.

Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.

You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.

This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence. 

The Nearest Neighbours topic can be divided into the following sub-topics:

  1. Brute-Force Search
  2. KD-Trees
  3. Ball-Trees
  4. K-Nearest Neighbours

Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.

K-Nearest Neighbours (KNNs)

A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/  – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).

Implementation

The algorithm of a KNN ML model is given below:

K-Nearest Neighbours

Again, mathematical! Let’s break it into small steps one at a time:

How the Algorithm Works

This explanation is for supervised learning binary classification.

Here we have two classes. We’ll call them A and B.

So the dataset is a collection of values which belong either to class A or class B.

A visual plot of the (arbitrary) data might look something like this:

Now, look at the star data point in the centre. To which class does it belong? A or B?

The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.

They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).

The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).

Finally, we return a class as output which is closest to the new data point, according to various measures.  The measures used include Euclidean distance among others.

This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.

Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.

Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on.  Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.

Obtain the Data and Preprocess it

We shall use the data from the UCI Repository, available at the following link:

http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) 

It needs to be manually converted into a CSV file, which is available at the following link:

https://github.com/selva86/datasets/blob/master/Sonar.csv

This data is a  set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:

Import the required packages first:

import numpy as np
import pandas as pd
import scipy as sp


from datetime import datetime
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics.classification import accuracy_score
from sklearn.metrics.classification import confusion_matrix
from sklearn.metrics.classification import classification_report 

Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function. 

> df = pd.read_csv("sonar.all-data.csv")
df.head()
0.0200  0.0371  0.0428  0.0207  0.0954 ...  0.0180  0.0084  0.0090  0.0032  R
0  0.0453  0.0523  0.0843  0.0689  0.1183 ...  0.0140  0.0049  0.0052  0.0044  R
1  0.0262  0.0582  0.1099  0.1083  0.0974 ...  0.0316  0.0164  0.0095  0.0078  R
2  0.0100  0.0171  0.0623  0.0205  0.0205 ...  0.0050  0.0044  0.0040  0.0117  R
3  0.0762  0.0666  0.0481  0.0394  0.0590 ...  0.0072  0.0048  0.0107  0.0094  R
4  0.0286  0.0453  0.0277  0.0174  0.0384 ...  0.0057  0.0027  0.0051  0.0062

Sonar Reading for Classification ML Problem

Check how much data you have and what its dimensions are:

> df.describe()
0.0200      0.0371     ...          0.0090      0.0032
count  207.000000  207.000000     ...      207.000000  207.000000
mean     0.029208    0.038443     ...        0.007936    0.006523
std      0.023038    0.033040     ...        0.006196    0.005038
min      0.001500    0.000600     ...        0.000100    0.000600
25%      0.013300    0.016400     ...        0.003650    0.003100
50%      0.022800    0.030800     ...        0.006300    0.005300
75%      0.035800    0.048100     ...        0.010350    0.008550
max      0.137100    0.233900     ...        0.036400    0.043900
> df.shape
(207, 61)

Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:

#Inputs (data values) sonar readings from an underground submarine. Cool!
X = df.values[:,0:-1].astype(float)

# Convert classes M (Mine) and R to numbers, since they're categorical values
le = LabelEncoder()

#Classification target
target = df.R
# Do conversion
le = LabelEncoder.fit(le, y = ["R", "M"])
y = le.transform(target)
› 

Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.

y
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

Examine your  pre-scaled input numpy array, X:

X
array([[0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044],
       [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078],
       [0.01  , 0.0171, 0.0623, ..., 0.0044, 0.004 , 0.0117],
       ...,
       [0.0522, 0.0437, 0.018 , ..., 0.0138, 0.0077, 0.0031],
       [0.0303, 0.0353, 0.049 , ..., 0.0079, 0.0036, 0.0048],
       [0.026 , 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]])

Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into  the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Fit the KNN classifier to the dataset.

#Train kneighbors classifier
from sklearn.neighbors.classification import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 5, metric = "minkowski", p = 1)

# Fit the model
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=1,
           weights='uniform')

As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 – x2| + |y1 – y2|. (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)

Output the statistical scoring of this classification model.

predicted = clf.predict(X_test)
print("Accuracy:")
print(accuracy_score(y_test, predicted))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predicted))
print("Classification Report:")
print(classification_report(y_test, predicted))
Accuracy:
0.8253968253968254
Confusion Matrix:
[[32  3]
 [ 8 20]]
Classification Report:
             precision    recall  f1-score   support

          0       0.80      0.91      0.85        35
          1       0.87      0.71      0.78        28

avg / total       0.83      0.83      0.82        63

Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!

The entire program as source code  in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):

[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/11/sonar.txt” download=”all” viewer=”google”]

To learn more about how k-nearest neighbours are used in practice, do check out the following excellent article on our blog:

https://dimensionless.in/spam-detection-with-natural-language-processing-part-3/ 

The following article is also an excellent reference for KNNs:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm 

Takeaways

K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.

Finally, Understand the Power of Machine Learning

Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!

Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.

And this is just the beginning!

Automation is the Future

As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already.  Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.

Data Science Interview Questions with Answers

Expertise Critical for Every Data Scientist

[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/10/Data-Science-topics.pdf” viewer=”google”]

 

The Best Way to Prepare for Interview Questions

Now suppose you read a question about a topic like overfitting. You can read the text and memorize the answer. Usually, articles with this heading (Interview Questions and Answers) are normally constructed that way, with plain text questions and answers. You could follow that route for interview preparation, but it is simply not the right thing to do. I can give you a list of important questions, with answers. Which is exactly what I will do in this article, later.

But you need to understand one thing clearly.

You cannot learn programming and data science from books alone.

You can learn the heading and the words. But the concept will truly be understood only in a practical manner; in a mini-project or in a worked-out example on the computer.

Data science is similar to programming in this regard.

Books are meant to just start your journey.

The real learning begins only when you implement it in code by yourself.

To take an example:

Question from the Interviewer:

“What is cross-validation and why is it important? How does it eliminate overfitting?”

A Good Answer:

“Cross-validation eliminates overfitting by exposing the model to the entire data set in a statistically uniform manner. Overfitting happens when the training set and test sets are not properly selected. If a model like LogisticRegression is trained until the error rate is very small, it may not be able to generalize to the pattern of data found in the test set. Hence the performance of the model would be excellent on the training set, but poor on the test set. This is because the model has overfitted itself to the training data. Thus, when presented with test data, error values increase because the generalization capacity of the model has been decreased and the model cannot discover the patterns of the test data.”

“K-fold Cross Validation prevents this by first dividing the total data into k sections and using one section as the test set and the remaining sections as the training set. We train k models, each time using a different fold as the test set and the remaining folds as the training set. Thus, we cover as many combinations of the training and test set as possible as input data. Finally, we take an average of the results of each model and return that as the output. So, overfitting is eliminated by using the entire data as input, one section (one of the k folds) being left out at a time to use as a test set. A common value for k is 10.”

Question:

“Can you show me how that works by coding it on a 10 by 10 array of integers? In Python?”

Worst Case Answer:

“Ummmmmmmm…..”

 “Sorry sir, I just studied that in a textbook. I am not sure how I could work through that by code.”

(!!!)

 

You Can’t Study Without Implementation

Data science should be studied in the way programming is studied. By working at it on a computer and running all the models in your textbook, and finally, doing your own mini-project, on every topic that could be important. Can you learn to drive a car by reading about it in a book? You need practical experience! Otherwise, all your preparation is meaningless. That is the point I wanted to make.

Now, having established this, I assume from here on that you are a data scientist in training who has worked the fundamental details on a computer and is familiar with the basics. You just need the finishing touches on your interview preparation. If that is the case; here are your topics for mini-projects and experiments! And – interview questions with answers.

Interview Practice Resources

Python Practice

https://www.testdome.com/d/python-interview-questions/9

This is a site that allows you to sharpen your skills in Python for interviews. There are many more sites like these, all you need to do is Google ‘Python Interview Questions’.

R Practice

https://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html

Many people know Python, but R is not as commonly known. The above tutorial spans 30 pages that you can work through with your R console to learn the basics. Alternatively, you could try Swirl (link given below), which is also highly recommended for beginners.

https://swirlstats.com/ 

Kaggle

Work through Kaggle competitions. No better way to establish yourself in the data science universe.

https://www.kaggle.com/competitions

 

Also, if you have basic data science skills, try your hand with the hands-on Kernels section. Cash prizes awarded every week!

https://www.kaggle.com/kernels

 

Oh, what are kernels? Kaggle Kernels are online Jupyter notebooks that allow you to run Python and R code interactively with your browser in the same application without any local processing. All computation is done on the Kaggle servers.

Top Ten Essential Data Science Questions with Answers

1. What is a normal distribution? And how is it significant in data science?

The normal distribution is a probability distribution, characterized by its mean and standard deviation or variance. The normal distribution with a mean of 0 and a variance of 1 looks like a bell, hence it is also referred to as the bell curve. The central limit theorem makes the normal distribution ubiquitous in data science. In its essence, the central limit theorem states that data values tend to be attracted to the normal distribution shape as the number of samples is increased without limit. This theorem is used in data science nearly everywhere, because it gives you an ‘expected’ value for an arbitrary dataset that has, say, n = one thousand samples. As n increases, if the data is normally distributed, the shape of the graph of that attribute will tend to look like the bell curve.

2. What do you mean by A/B testing?

An A/B test records the results of two random variables or hypotheses (depending upon the scenario) and compares the rate of success or accuracy for the variable being in the state of A or the state of B. This often tells us which feature should be used to build a machine learning model. It is also used to select which model to use in the first place. A/B testing is a general concept that can be applied to nearly every system.

3. What are eigenvalues and eigenvectors?

The eigenvectors of a matrix that is non-singular (determinant not = 0) are the values associated with linear transformations of that matrix. They are calculated using the correlation or covariance matrix functions. The eigenvalues are the values associated with the strength or the degree of a linear transformation (such as bending or rotating). See Linear Algebra by Gilbert Strang (online ebook) for more details on their computation.

4. How do the recommender systems in Amazon and Netflix work? (research paper pdf)

Recommender systems in Amazon and Netflix are considered top-secret and are usually described as black boxes. But their internal mechanism has been partially worked out by researchers. A recommender system, predated by expert systems models in the 90s, is used to generate rules or ‘explanations’ as to why a product might be more attractive to user X than user Y. Complex algorithms are used, which have many inputs, such as past history genre, to generate the following types of explanations: functional, intentional, scientific and causal. These explanations, which can also be called user-invoked, automatic or intelligent, are tuned by certain metrics such as user satisfaction, user rating, trust, reliability, effectiveness, persuasiveness etc. The exact algorithm still remains an industry secret, similar to the way that Google keeps the algorithms that perform PageRank secret and constantly updated (500-600 times a year in the case of Google).

5. What is the probability of an impossible event, a past event and what is the range of a probability value?

An impossible event E has P(E) = 0. Probabilities take on values only in the closed interval [0, 1]. The probability of event that is from the past is an event that has already occurred and here P(E) = 1.

6. How do we treat missing values in datasets?

A categorical missing value is given its default value. A continuous missing value is usually assigned using the normal distribution, or the measures of central tendency like mean, median and mode. If a feature has less than 20% available data, the recommendation is to delete that feature from the model.

7. Which is faster, Python or R?

Python is considered to be moderately medium-paced since C++ is much faster for all purposes. Besides which, Python is an interpreted and not a compiled language. Python language is implemented in C to speed up execution time. R, however, was designed by statisticians, not computer scientists, and is much slower than Python.

8. What is Deep Learning and why is it such a popular buzzword in the machine learning field right now?

For many years, until around 2006, backpropagation neural networks had just three layers – one input, one hidden and one output layer. The problem with this model was that since it used gradient descent and the backpropagation algorithm, the neural nets had a tendency to be attracted towards the local minima in the hyperplane that represented the dimensions of the input features. Thus, NNs could not be used for many applications optimally, since they could only find a partially optimal solution. In 2006, Geoffrey Hinton et. al. published a research paper that showed that multilayer neural networks could overcome the problem of local minima since, in thousands of dimensions, local minima are statistically so rare as to never be found in the back-propagation process (saddle points are common instead). Deep learning refers to neural nets with 3 or more (even 10) hidden layers. They require more computational power and were one of the reasons that GPUs started to be used by the machine learning community for implementation of deep learnings NNs. Since 2010-2012, deep learning has been applied to nearly every single technology domain, and the models have been highly accurate and successful in all areas from speech recognition to playing the Japanese game of Go.

9. What is the difference between machine learning and deep learning?

For more details on that, I suggest you go through this excellent article, given on the following link on our blog below:

https://dimensionless.in/machine-learning-and-deep-learning-differences/

10. What is Reinforcement Learning?

For an excellent explanation of reinforcement learning that is both educational and fun to read, please visit the following page, also on our blog :

https://dimensionless.in/reinforcement-learning-super-mario-alphago/

Enjoy Your Work!

To finally sum up, I have to say, enjoy your work. You will be much better at what you love than something that is glamorous but not to your taste. Artificial Intelligence, Data Science, Software Development and Machine Learning are very much in my preferred line of work, and my hope is, that it will be in yours too. Don’t just read the text, work out the code on your systems or on Kaggle. That is how to best prepare for interview questions. Only practice at your computer (preferably on Kaggle) will give you true confidence on the day of your interview. That is true expertise – practice making perfect. Enjoy data science!

5 Steps to Prepare for a Data Science Job

A career in data science is hyped as the hottest job of the 21st century, but how do you become a data scientist? How should you, as an aspiring data scientist, or a student who aims at a data science job, prepare? What are the skills you need? What must you do? Fret not – this article will answer all your questions and give you links with which you can jump-start a new career in data science!

Data science as a field is a cross-disciplinary topic. By this, we mean that the data scientist has to know multiple fields and be an expert in many different things. A data scientist must have a strong foundation in the following subjects:

  1. Computer Science
  2. Statistical Research (solid foundation required)
  3. Linear Algebra
  4. Data Processing (data analyst expertise)
  5. Machine Learning
  6. Software Engineering
  7. Python Programming
  8. R Programming
  9. Business Domain Knowledge

The following diagram shows a little bit of the subjects you will need to master to become a high-quality data scientist:

data science skill set

Now unless you have been focused like a laser beam and have deliberately focused your studies in these areas, it is likely that you will not know one or more of the topics given above. Or you may know two or three really well but may not be solid in the rest. For example, you could be a computer science student who knows mathematics but not statistics to the in-depth level that Analysis of Statistical Research requires. Or you could be a statistician who has a little foundation in programming.

But there are ways to get past that crucial job interview. The five things you must do are:

  1. Learn Python and R from quality trainers with years of industry experience
  2. Build a portfolio of data science projects on GitHub
  3. Join Kaggle and participate in data science competitions
  4. Practice Interview Questions 
  5. Do basic Online Reputation Management to improve your online presence.

 

1. Learn Python and R from the best trainers available

r and python

There is no substitute for industry experience. If your instructor is not just an enthusiastic amateur (as in the case of many courses available online) but someone with 5+ years of industry experience working in the data science industry, you have the best possible trainers in the field. It is one thing to learn Python and R. It is quite a completely different thing to master Python and R. If you want to do well in the industry, mastery is required, not just basic abilities. Make sure your faculty members have verified industry experience. Because that experience is what will count in finally landing you a job in a top-notch data science company. You will always learn the most from experts who have industry experience rather than academics who have a Ph.D. even in the subject but have not worked in the field.

2. Build a GitHub Portfolio of Data Science Projects

Having an online portfolio in GitHub is critical!

All the best training in the field will take you nowhere if you don’t code what you learn and apply the lessons to real-life datasets and scenarios. You need to do data science projects. Try to make your projects as attractive as possible. As much as you can, your GitHub project portfolio should be built with these guidelines in mind:

  1. Use libraries, languages, and tools that your target companies work with.
  2. Use datasets that are used by your companies, and always use real-world data. (no academic datasets like the ones supplied with scikit-learn. Use Kaggle to get practice datasets.) The best datasets are programmatically constructed with APIs from Twitter, Facebook, Wikipedia, and similar real-world scenarios.
  3. Choose problems that have market value. Don’t choose an academic project, but solve a real-world industry problem.
  4. Extra marks for creativity and originality in the problem definitions and the questions answered by the portfolio projects.

3. Join Kaggle or TopCoder and participate in Competitions

 

Kaggle.com is your training arena.

If you are into data science, become a Kaggler immediately! Or, if your taste leans more towards development, join TopCoder (they also have data science tracks). Kaggle is widely touted as the home of data science and for good reason, since Kaggle has been hosting data science competitions for many years and is the international location of all the best data science competitions. One of the simplest ways to get a call from a reputed company is to rank as high as possible on Kaggle. What is more, you will be able to compare your performance with the top competition in the industry.

4. Practice Interview Questions

There are plenty of sites available online that have excellent collections of industry questions used in data science interviews. Now, no one expects you to mug up 200 interview questions, but they do expect you to be able to solve basic data science and algorithm questions in code (Python preferably) or in pseudocode. You also need to know basic concepts like what cross-validation is, the curse of dimensionality, and the problem of overfitting and how you deal with it in practice in real-world scenarios. You should also be able to explain the internal details of most data science algorithms, for example, AdaBoost. Knowledge of linear algebra, statistics, and some basic multivariable calculus is also required to possess that extra edge over the competition.

5. Manage your Online Search Reputation

This may not seem connected with data science, but it is a fundamental component in any job search. What is the first thing that a prospective employer looks for while hunting for job candidates, when given a name? That’s right – he’ll Google it first. What comes up when you Google your name? Is your online profile safe under scrutiny? That is:

  1. Is your name when searched on Google free of red flags like negative reports of any type (offensive material, controversies)?
  2. Does the search engine entry for your name represent your profile with accuracy?
  3. Are your public Facebook, Twitter and Google profiles free of any automatic red flags? (e.g. intimate pictures)?
  4. Does the Google visibility of your name depict your skill levels correctly?

If the answers to any of these questions are no, you may need to adjust or tweak your online profile. You can do this by blog posts, informed mature comments online, or even creating a blog for yourself and speaking about yourself to the world in a positive manner. This is critical for any job applicant today, in this online, digital, connected world.

You are a Product to be Marketed!

You are trying to sell yourself and your credibility online to people who have never seen you, and not even heard your name. Your Internet profile will make the key crucial difference here, to make sure you stand out from the competition. Many training sites are available that offer courses by amateurs or people with less than 2 years of industry experience. Don’t make the unwise choice to be satisfied with a low-price course. On the Internet, you will get only what you pay for. And this is your future career in the subject area of your dreams. Surely a little initial investment will go a long way in the long run.

Additionally, it will help to gain the employers’ perspective as well. You can refer to this Hiring Guide by TopTal for further reading.

Always keep learning. ML and AI are fields that move forward at an incredible pace. Subscribing to RSS feeds and online websites that keep you updated with the latest developments in the field is something that you absolutely have to do. Nothing shows your commitment to excellence a much as keeping up with the latest state-of-the-art research. And you can do it quite easily by using Reader applications like Feedly and Inoreader. Learning might be something you do in college. But mastery is something you aim towards for your entire lifetime. Never give up. All the best for your job search, which will definitely be successful if you can follow the instructions mentioned here on this blog post. Finally, pay special attention to your portfolio of data science projects on GitHub to make sure you stand out from the competition.

Python Vs R : The Eternal Question for Data Scientists

python vs R programming R programming vs Python

 

 

 

 

 

 

 

Python and R are the two most commonly used languages for Data Science today. They are both fully open source products and completely free to use and modify as required under the GNU public license.

But which one is better? And, more importantly, which one should you learn?
Both are widely used and are standard tools in the hands of every data scientist.

The answer may surprise you – because as a professional data scientist, you should be ready to deal with both.

Python has certain use cases and so does R. The scenarios in which they are used vary. It is more often the environment and the needs of the client or your employer which dictates the choice between Python and R. Many things are easier in Python. But R also has its place in your development toolkit.

Python

 

Python is a general-purpose programming language released by Guido Van Rossum in 1991.

Since then, Python has been used in multiple environments for multiple purposes, including, but not limited to:
● Web Development (Django)
● Web Microservices (Flask)
● Zappa Serverless Framework for Python
● TensorFlow (Deep Learning Machine Learning Models)
● Keras (High-Level Abstractions to Simplify TensorFlow Development)
● Popular apps built in Python include Dropbox, BitTorrent, Morpheus, Calibre, Blender, and Mercurial – among many, many others.

Python has more appeal for software engineers. This is mainly because industry use, production-ready code can be usually written in Python. If you have the background of a software engineer or already know programming, Python is a better choice for you (especially if you’re a beginner).

Another situation where Python shines is the sheer number pre-existing libraries that are readily available and open sourced to use. The large number of packages available in the PyPI (short for Python Package Index) repository with over 121k packages that automate many programming tasks at various levels of abstraction, making life easy for the programmer. At least 6k out of the packages on PyPI are focused on data science. Python also excels in readability. Compared to R, Python is much easier to read and to understand. Python is faster than R, in some cases dramatically faster.

R

 

R is a statisticians programming language designed for statisticians by statisticians. It originated in the ‘90s through George Ross Ihaka and Robert Gentleman. R excels in academic use and in the hands of a statistician. People who have formal training in Statistics, such as a Statistics degree, find working with R extremely simple. The repository for R packages or libraries, called CRAN (Comprehensive R Archive Network) contains nearly 12k packages, roughly half of which are for data science. R also excels at data visualization. Analyzing data on a one-time basis is often simpler and more easily expressible in R.

Also, once upon a time, using Python meant linking many libraries together, some of which would become incompatible after feature revisions and library updates. That is no longer true because of Anaconda – see below. For a short time, deep learning was strictly a Python feature – which shifted the balance of the machine learning world towards Python, for a short time. However, with the release of Keras for TensorFlow in R, that factor changed as well, and deep learning models could now be used in R.

So, what is the answer? Which one should you use?

The answer – both.

Jupyter Notebook – Integrating Python and R

The Anaconda distribution from Continuum Analytics has completely disrupted the machine learning picture. Anaconda supports the standard libraries required for Python and machine learning – NumPy, SciPy, Pandas, SymPy, Seaborn, Matplotlib – as well as full support for R with an outstanding IDE called R Studio.
For Deep Learning it supports TensorFlow, Theano, Caffe, Scikit-Learn, and Torch. One of its most remarkable features is the introduction of the Jupyter Notebook, an integrated platform which supported the use of Python and R in the same environment while keeping everything open source.

Another option is the Hydrogen plug-in for the Atom text editor. It allows you to enter any code that you can use in a Jupyter Notebook and returns the result in the editor. However, it is still in alpha and crashed with an error on my local machine. The Jupyter Lab application allows Python and R notebook editing in the same environment, using the concept of separate and even remote kernels.

As the machine learning field progresses, one can expect more plugins like Hydrogen (which I can’t wait to test once it’s out of alpha) produced in the very near future. So, Python excels in machine learning, while R excels in statistics.
But why should you learn both?

Because a professional data scientist needs to understand statistics and the mathematics behind the machine learning algorithms in great detail.

We shall examine two SVM machine learning models, one through Python code, and then another through R code. This will give us a good picture of how both languages work.

Python Code

This code performs binary classification using non-linear support vector machine using a Gaussian kernel. The target to predict is a XOR of the inputs.
The color map illustrates the decision function learned by the Support Vector Machine Classifier. (SVC)

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-3, 3, 500),
 np.linspace(-3, 3, 500))
np.random.seed(0)
X = np.random.randn(300, 2)
Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

# fit the model
clf = svm.NuSVC()
clf.fit(X, Y)

# plot the decision function for each datapoint on the grid
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.imshow(Z, interpolation=’nearest’,
extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect=’auto’,
origin=’lower’, cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
linetypes=’–‘)
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,
edgecolors=’k’)
plt.xticks(())
plt.yticks(())
plt.axis([-3, 3, -3, 3])
plt.show()

.Output:

References:

http://scikit-learn.org/stable/auto_examples/svm/plot_svm_nonlinear.html#sphx-glr-auto-examples-svm-plot-svm-nonlinear-py

R Code

This program uses the iris dataset to illustrate the use of a non-linear SVM classifier. This code is deliberately a little more complex since it applies ML techniques to a full-fledged built in dataset – the iris dataset – one of the canonical data sets used to illustrate the capacities of the ML techniques traditionally. THis code also illustrates the usage of the built-in statistical functions of R.
You will need to install R package e1071 and add it to the compile list by calling library(e1071) before executing the code below. But don’t worry – installing new packages in R Studio is ridiculously simple.

# NOT RUN {
data(iris)
attach(iris)

## classification mode
# default with factor response:

library(e1071)

#loads the svm library into the compile path

 model <- svm(Species ~ ., data = iris)

# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y)

print(model)
summary(model)

# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)

# Check accuracy:
table(pred, y)

.Output:

            y

pred         setosa versicolor virginica

  setosa         50          0         0

  versicolor      0         48         2

  virginica       0          2        48

<code>

# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-
5])),
col = as.integer(iris[,5]),
pch = c(
“o”,“+”)[1:150 %in% model$index + 1])

## try regression mode on two dimensions

# create data
x <- seq(
0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)

# estimate model and predict input values
m   <- svm(x, y)
new <- predict(m, x)

# visualize
plot(x, y)
points(x, log(x), col =
2)
points(x, new, col = 4)

## density-estimation

# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(
1000), b = rnorm(1000))
attach(X)

# traditional way:
m <- svm(X, gamma =
0.1)

# formula interface:
m <- svm(~., data = X, gamma =
0.1)
# or:
m <- svm(~ a + b, gamma =
0.1)

# test:
newdata <- data.frame(a = c(
0, 4), b = c(0, 4))
predict (m, newdata)

# visualize:
plot(X, col =
1:1000 %in% m$index + 1, xlim = c(-5,5), ylim=c(-5,5))
points(newdata, pch =
“+”, col = 2, cex = 5)

i2 <- iris
levels(i2$Species)[3] <-“versicolor”
summary(i2$Species)
wts <-
100 / table(i2$Species)
wts
m <- svm(Species ~ ., data = i2, class.weights = wts)
# }

Output:

As you can see, the R code is fundamentally more powerful in its graphing and statistical abilities than Python. Being a language of statisticians by statisticians, if you have a statistics background, using R will be the best launchpad for your new career in data science.
References
https://www.rdocumentation.org/packages/e1071/versions/1.7-0/topics/svm

Conclusion

Thus, when it comes to choosing between Python and R, any data scientist who is worth his money will know that he is supposed to know both.
And in the end, all the most advanced software engineering skills won’t get you anywhere unless you have a firm foundation in Statistics – or a professional statistician in your team. The main reason that we use analytics is to make business decisions. And we can utilize it best when we have an iron-clad grasp of the entire picture.
So, on Python versus R, to sum up:
Both perform similar tasks in data science but are optimized toward different domains. If you are a software engineer, choose Python. If you are an academic researcher, choose R.
And if you are a data scientist – choose both.

References

1. https://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
2. https://www.quora.com/Which-is-better-for-data-analysis-R-or-Python-Is-R-still-a-better-data-analysis-language-than-Python-Has-anyone-else-used-Python-with-Pandas-to-a-large-extent-in-data-analysis-projects
3. https://medium.com/@data_driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
4. https://blog.usejournal.com/python-vs-and-r-for-data-science-833b48ccc91d
5. https://elitedatascience.com/r-vs-python-for-data-science
6. https://www.newgenapps.com/blog/r-vs-python-for-data-science-big-data-artificial-intelligence-ml
7. https://www.dataquest.io/blog/python-vs-r/

Source Code References
1. http://scikit-learn.org/stable/auto_examples/svm/plot_svm_nonlinear.html#sphx-glr-auto-examples-svm-plot-svm-nonlinear-py
2. https://www.rdocumentation.org/packages/e1071/versions/1.7-0/topics/svm