9923170071 / 8108094992 info@dimensionless.in
Machine Learning (ML) Essentials

Machine Learning (ML) Essentials

When I started my Data Science journey, I casually Googled ‘Application of Machine Learning Algorithms’. For the next 10 minutes, I had my jaw hanging. Quite literally. Part of the reason was that they were all around me. That music video from YouTube recommendations that you ended up playing hundred times on loop? That’s Machine Learning for you. Ever wondered how Google keyboard completes your sentence better than your bestie ever could? Machine Learning again!

So how does this magic happen? What do you need to perform this witchcraft? Before you move further, let me tell you who would benefit from this article.

  • Someone who has just begun her/his Data Science journey and is looking for theory and application on the same platter.
  • Someone who has a basic idea of probability and linear algebra.
  • Someone who wants a brief mathematical understanding of ML and not just a small talk like the one you did with your neighbour this morning.
  • Someone who aims at preparing for a Data Science job interview.

‘Machine Learning’ literally means that a machine (in this case an algorithm running on a computer) learns from the data it is fed. For example, you have customer data for a supermarket. The data consists of customers age, gender, time of entry and exit and the total purchase. You train a Machine Learning algorithm to learn the purchase pattern of customers and predict the purchase amount for a new customer by asking for his age, gender, time of entry and exit.

Now, Let’s dig deep and explore the workings of it.

Machine Learning (ML) Algorithms

Before we talk about the various classes, let us define some terms:

Seen data or Train Data

This is all the information we have. For example, data of 1000 customers with their age, gender, time of entry and exit and their purchases.

Predicted Variable (or Y)

The ML algorithm is trained to predict this variable. In our example, the ‘Purchase amount’. The predicted variable is usually called the dependent variable.

Features (or X)

Everything in the data except for Y. Basically, the input that is fed to the model. Features are usually called the independent variable.

Data table | Essentials of ML

 

Model Parameters

Parameters define our ML model. This will be understood later as we discuss each model. For now, remember that our main goal is to evaluate these parameters.

Unseen data or Test Data

This is the data for which we have the X but not Y. The why has to be predicted using the ML model trained on the seen data.

Now that we have defined our terms, let’s move to the classes of Machine Learning or ML algorithms.

Supervised Learning Algorithms:

These algorithms require you to feed the data along with the predicted variable. The parameters of the model are then learned from this data in such a way that error in prediction is minimized. This will be more clear when individual algorithms are discussed.

Unsupervised Learning Algorithms:

These algorithms do not require data with predicted variables. Then what do we predict? Nothing. We just cluster these data points.

If you have any doubts about the things discussed above, keep on reading. It will get clearer as you see examples.

Cross-validation :

A general strategy used for setting parameters for any ML algorithm in general. You take out a small part of your training (seen) data, say 20%. You train an ML model on the 80% and then check it’s performance on that 20% of data (remember you have the Y values for this 20 %). You tweak the parameters until you get minimum error. Take a look at the flowchart below.

Cross validation in ML Flowchart

Supervised Learning Algorithms

In Supervised Machine Learning, there are two types of predictions – Regression or Classification. Classification means predicting classes of a data point. For example – Gender, Type of flower, Whether a person will pay his credit card bill or not. The predicted variable has 2 or more possible discrete values. Regression means predicting a numeric value of a data point. For example – Purchase amount, Age of a person, Price of a house, Amount of predicted rainfall, etc. The predicted class is a continuous variable. A few algorithms perform one of either task. Others can be used for both the tasks. I will mention the same for each algorithm we discuss.  Let’s start with the most simple one and slowly move to more complex algorithms.

KNN: K-Nearest Neighbours

“You are the average of 5 people you surround yourself with”-John Rim

 

Congratulations! You just learned your first ML algorithm.

Don’t believe me? Let’s prove it!

Consider the case of classification. Let’s set K, which is the number of closest neighbours to be considered equal to 3. We have 6 seen data points whose features are height and weight of individuals and predicted variable is whether or not they are obese.

KNN example scatter plot

 

Consider a point from the unseen data (in green). Our algorithm has to predict whether the person represented by the green data point is obese or not. If we consider it’s K(=3) nearest neighbours, we have 2 obese (blue) and one not obese (orange) data points. We take the majority vote of these 3 neighbours which is ‘Yes’. Hence, we predict this individual to be obese. In case of regression, everything remains the same, except that we take the average of the Y values of our K neighbours. How to set the value of K? Using cross-validation.

 

Key facts about KNN:

  • KNN performs poorly in higher dimensional data, i.e. data with too many features.
    (Curse of dimenstionality)
  • Euclidean distance is used for computing distance between continuous variables. If the data has categorical variables (gender, for example), Hamming distance is used for such variables. There are many popular distance measures used apart from these. You can find a detailed explanation here.

Linear Regression

This is yet another simple, but an extremely powerful model. It is only used for regression purposes. It is represented by

linear regression formula

….(1)

Y’ is the value of the predicted variable according to the model. X1, X2,…Xn are input features. Wo, W1..Wn are the parameters (also called weights) of the model. Our aim is to estimate the parameters from the training data to completely define the model.

How do we do that? Let’s start with our objective which is to minimize the error in the prediction of our target variable. How do we define our error? The most common way is to use the MSE or Mean Squared Error – Mean squared error formula

For all N points, we sum the squares of the difference of the predicted value of Y by the model, i.e. Y’ and the actual value of the predicted variable for that point, i.e. Y.

We then replace Y’ with equation (1) and differentiate this MSE with respect to parameters W0,W1..Wn and equate it to 0 to get values of the parameters at which the error is minimum.

An example of how a linear regression might look like is shown below.

linear regression graph example

 

Sometimes it is not necessary that our dependent variable follows linear dependency on our independent variable. For example, Weight in the above graph may vary with the square of Height. This is called polynomial regression (Y varies with some power of X).

Good news is that any polynomial regression can be transformed to linear regression. How?

We transform the independent variable. Take a look at the Height variable in both the tables.

linear regression example table 1

Table 1

linear regression example table 2

table 2

 

We will forget about Table 1 and treat the polynomial regression problem like a linear regression problem. Only this time, Weight will be linear in Height squared (notice the x-axis in the figure below).

polynomial regression graph

A very important question about every ML model one should ask is – How do you measure the performance of the model? One popular measure is R-squared

 

R-squared: Intuitively, it measures how well the data and hence the model explains the variation in the dependent variable. How? Consider the following question – If you had just the Y values and no X values in your data, and someone asks you, “Hey! For this X, what would you predict the Y to be?” What would be your best guess? The average of all the Y values you have! In this scenario of limited information, you are better off guessing the average of Y for any X than anything other value of Y.

 

But, now that you have X and Y values, you want to see how well your linear regression model predicts Y for any unseen X. R-squared quantifies the performance of your linear regression model over this ‘baseline model’

r square formula

 

MSE is the mean squared error as discussed before. TSE is the total squared error or the baseline model error.

 mean squared error

Naive Bayes

Naive Bayes is a classification algorithm. As the name suggests, it is based on Bayes rule.

naive bayes formula

Intuitive Breakdown of Bayes rule: Consider a classification problem where we are asked to predict the class of a data point x. We have two classes and the classes are denoted by letter C.

Now, P(c), also known as the ‘Prior Probability’ is the probability of a data point belonging to class C, when we don’t have any data. For example, if we have 100 roses and 200 sunflowers and someone asks you to classify an unseen flower while providing you with no information, what would you say?

 

P(rose)  = 100/300 = ⅓       P(sunflower) = 200/300 = ⅔

 

Since P(sunflower) is higher, your best guess would be a sunflower. P(rose) and P(sunflower) are prior probabilities of the two classes.

 

Now, you have additional information about your 300 flowers. The information is related to thorns on their stem. Look at the table below.

Flower\Thorns Thorns No Thorns
Rose (Total 100) 90 10
Sunflower (Total 200) 50 150

 

Now come back to the unseen flower. You are told that this unseen flower has thorns. Let this information about thorns be X.

P(rose|X) = 90/100 = 9/10                         P(sunflower|X) = 50/150 = ⅓

Now according to Bayes rule, the numerator for the two classes are as follows –

Rose = 1/3*9/10 = 3/10 = 0.3

Sunflower = 2/3*1/3  = 2/9 = 0.22

The denominator, P(x), called the evidence is the cumulative probability of seeing the data point itself. In this case it is equal to 0.3 + 0.22 = 0.52. Since it does not depend on the class, it won’t affect our decision-making process. We will ignore it for our purposes.

Since, 0.3>0.22

P(Rose|X) > P(sunflower|X)

Therefore, our prediction would be that the unseen flower is a Rose. Notice that our prior probabilities of both the classes favoured Sunflower. But as soon as we factored the data about thorns, our decision changed.

If you understood the above example, you have a fair idea of the Naive Bayes Algorithm.

This simple example where we had only one feature (information about thorns) can be extended to multiple features. Let these features be x1, x2, x3 … xn. Bayes Rule would look like  –

Bayes rule with one feature

Note that we assume the features to be independent. Meaning,

bayes rule with multiple features

The algorithm is called ‘Naive’ because of the above assumption

 

Logistic Regression

Logistic regression, unlike its name, is used for classification purposes. The mathematical model used for logistic regression is called the logit function. Consider two classes 0 and 1.

Logistic regression formula

logistic regression fromula

P(y=1) denotes the probability of belonging to class 1 and 1-P(y=1) is thus the probability of the data point belonging to class 0 (notice that the range of the function for all WT*X is between 0 and 1). Like other models, we need to learn the parameters w0, w1, w2, … wn to completely define the model. Like linear regression has MSE to quantify the loss for any error made in the prediction, logistic regression has the following loss function –

P is the probability of a data point belonging to class 1 as predicted by the model. Y is the actual class of the model.

Think about this – If the actual class of a data point is 1 and the model predicts P to be 1, we have 0 loss. This makes sense. On the other hand, if P was 0 for the same data point, the loss would be -infinity. This is the worst case scenario. This loss function is used in the Gradient Descent Algorithm to reach the parameters at which the loss is minimum.

Okay! So now we have a model that can predict the probability of an unseen data point belonging to class 1. But how do we make a decision for that point? Remember that our final goal is to assign classes, not just probabilities.

At what probability threshold do we say that the point belongs to class 1. Well, the model assigns the class according to the probabilities. If P>0.5, the class if obviously 1. However, we can change this threshold to maximize the metric of our interest ( precision, recall…), we can choose the best threshold using cross-validation.

This was Logistic Regression for you. Of course, do follow the coding tutorial!

Decision Tree

“Suppose there exist two explanations for an occurrence. In this case, the one that requires the least speculation is usually better.” – Occam’s Razor

The above philosophical principle precisely guides one of the most popular supervised ML algorithm. Decision trees, unlike other algorithms, are non-parametric algorithms. We don’t necessarily need to specify any parameter to completely define the model unlike KNN (where we need to specify K).

Let’s take an example to understand this algorithm. Consider a classification problem with two classes 1 and 0. The data has 2 features X and Y. The points are scattered on the X-Y plane as

Decision tree 1

Our job is to make a tree that asks yes or no questions to a feature in order to create classification boundaries. Consider the tree below:

decision tree example

The tree has a ‘Root Node’ which is ‘X>10’. If yes, then the point lands at the leaf node with class 1. Else it goes to the other node where it is asked if its Y value is <20. Depending on the answer, it goes to either of the leaf nodes. Boundaries would look something like –

dt example graph

How to decide which feature should be chosen to bifurcate the data? The concept of ‘Purity ‘ is used here. Basically, we measure how pure (pure in 0s or pure in 1s) our data becomes on both the sides as compared to the node from where it was split. For example, if we have 50 1s and 50 0s at some node. After splitting, we have 40 1s and 10 0s on one side and 10 1s and 40 0s on the other, then we have a good splitting (one node is purer in 1s and the other in 0s). This goodness of splitting is quantified using the concept of Information Gain. Details can be found here.

Conclusion

If you have come so far, awesome job! You now have a fair level of understanding of basic ML algorithms along with their applications in Python. Now that you have a solid foundation, you can easily tackle advanced algorithms like Neural Nets, SVMs, XGBoost and many others.

Happy learning!

Building Blocks of Decision Tree

It’s been said that Data Scientist is the “sexiest job title of the 21st century.”  This is because of one main reason that there is a humongous amount of data available as we are producing data at a rate as never before. With the dramatic access to data, there are sophisticated algorithms present such as Decision trees, Random Forests etc. When there is a humongous amount of data available, the most intricate part is to select the correct algorithm to solve the problem. Each model has its own pros and cons and should be selected depending on the type of problem at hand and data available.

Decision Trees:

The aim of this blog post is to discuss one of the most widely used Machine Learning algorithm: “Decision Trees”. As the name suggests, it uses a tree-like model to make decisions as shown in below figure.  Decision Tree is drawn upside down with its root at the top. A question is asked at every node based on which decision tree splits into branches. The end of the tree which doesn’t split further is called as Leaf.


Decision tree | dimensionless

Decision Trees can be used for classification as well as regression problems. That’s why there are called as Classification or Regression Trees(CART). In the above example, a decision tree is being used for a classification problem to decide whether a person is fit or unit. The depth of the tree is referred to length of the tree from root node to leaf.

For basics of the decision tree, refer:

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

Have you ever given the thought that if there are so many sophisticated algorithms available such as neural networks which are better in terms of parameters such as accuracy then why decision trees are one of the most widely used algorithms?

The biggest advantage of Decision Trees is interpretability. Let’s talk about neural networks to understand this. To make the concept of neural network easy to understand, let’s consider the neural network as “Black Box”. The set of input data is given to the black box and it produces the output corresponding to the input data set. Now, What’s inside the black box?  Black Box consists of a computational unit which consists of several hidden layers depending on the intricacy of problem. Also, a large amount of data set is required to train these hidden layers. With the increased no. of hidden layers, there is a significant increase in the complexity of neural networks. It becomes very hard to interpret the output of neural networks in such cases. That’s where lies the importance of decision trees. Decision Trees interpretability helps the humans to understand what’s happening inside the black box? This can help significantly to improve the performance of neural networks in terms of several parameters such as accuracy, avoiding the overfitting etc.

Another advantage of Decision Trees includes a nonlinear relationship between the parameters doesn’t affect the tree performance, Decision implicitly performs the feature selection and minimal effort for data cleaning.

As already discussed, every algorithm has it’s pros and cons. Disadvantages of Decision Trees include poor performance if the decision tree is overfitted to data and could not generalize well.  Decision trees can be unstable for small variations of data. So, this variation should be reduced by methods such as bagging, boosting etc.

If you have used ever implemented decision trees: Have you ever thought what’s happening in the background when you implement a decision tree using sci-kit learn in Python? Let’s understand the nitty-gritty of decision trees i.e. various functions such as train test split, checking the purity of data, classification of data, calculating the overall entropy of data etc. that runs in the background. Let’s understand the concept of the decision tree by implementing it from scratch i.e. with the help from numpy and pandas (without using skicit learn).

Here, I am using Titanic dataset to build a decision tree. Decision tree needs to be trained to classify whether the passenger is dead or survived based on parameters such as Age, gender, Pclass. Note that titanic data set contains various variables such as passenger name, address etc which are dropped because they are just identifiers and doesn’t add value to the decision tree. This process is formally called “Feature Selection.”

Decision tree - 2 | dimensionless 

Data Cleaning:

The first step toward building the ML algorithm is data cleaning. It is one of the most important step because the model build on unformatted data can affect the performance of the model significantly. I am going to use the Titanic data set to build the decision tree. The aim of this model is to classify the passengers as survived or not based on information given. The first step is to load the data set, clean it. Cleaning of data consists of 2 steps:

Dropping the variables which are of least importance in deciding. In Titanic dataset columns such as name, cabin no. ticket no. is of least importance. So, they have been dropped.

Fill in all the missing values i.e. replace NA’s with the most suitable central tendency. All the NA’s are replaced with Mean in case of continuous variables and mode in case of categorical variables. Once the data is clean, we will try to build the helper functions which will help us to write the main algorithm.

Train-Test Split:

We are going to divide the data set into 2 sets i.e. train data set and test data set. I have kept the train test split ratio as 80:20. However, this could be different. The best practice is to keep the test data set small as compared to the train data set depending on the size of the data set. But the test should not be so small that it is not representative of the population

In the case of large data sets, the data set is divided into 3 categories: Training Data Set, Validation Data Set, Test Data Set. Train data is used to train the model, validation data set is used to tune the model in terms of parameters such as accuracy, overfitting. Once the model is verified for the optimal accuracy, it can be used for testing.

Check Data Purity and classify:

As shown in the block diagram, Data Pure function check the data set for its purity i.e. if the data set contains only one species of flower. If so, it will classify the data. If the data sets contain different species of flowers, it will try to ask the questions which can accurately segregate the different species of flowers. It is done by implementing functions such as potential splits, split data, calculate the overall entropy. Let’s understand each of them in detail.

 

 

Potential Splits Function:

Potential splits function gives all the possible splits for all of the variables. In Titanic dataset, there are 2 kinds of variables Categorical and Continuous. Both the types of variables are handled in a different way.

For a categorical variable, each unique value is taken as possible split. Let’s take the example of gender. Gender can only take up 2 values either male or female. Possible splits are male and Female. The question will be simple: Is Gender == “Male”. In this way, the decision tree can segregate the population based on Gender.

Since the continuous variable can take on any value. So, the potential split can be exactly in the midpoint of two values in the data set. Let’s understand this with the help of “Fare” Variable in Titanic data set. Suppose, Fare = {10,20,30,40,50}. Potential Splits for Fare will be: 15, 25,35,45. If we ask the question “If Fare <=25”, we can segregate the data effectively. Since Fare variable can take up any value between 10 to 50 in the above case, we can’t deal the continuous variable in the same way as a categorical variable.

Once potential split function gives all the potential splits for all the variables. Split data function is used to split data based on each and every potential split. It divides the data into 2 halves, data above and data below. Entropy is calculated for each split.

 

Measure of Impurity:

There are two methods to measure the impurity: Gini Impurity and Information Gain Entropy. Both the methods work the same and the selection of impurity measure has little impact on the performance of the decision tree. But Entropy is computationally expensive since it deals with the logarithmic functions. This is the main reason for the large-scale use of Gini Impurity over Information Gain Entropy. Below are formulae for both:

  1. Gini: Gini(E)= 1-∑j=1(pj2)
  2. Entropy: H(E)=−∑cj=1(pjlogpj)

I have used Information Gain Entropy as a measure of impurity. You can use either of them, as both give pretty much same results specifically in case of CART analytics as shown in below figure. Entropy in simpler terms is the measure of randomness or uncertainty. For each potential split, entropy will be calculated. The best potential split will be the one with the lowest overall entropy. This is because of lower the entropy, lower the uncertainty and hence more the probability.

method of impurity measure | Gini Impurity

Refer to below link for the comparison between the two methods of impurity measure:

https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

To calculate entropy in titanic dataset example, calculate entropy and calculate overall entropy functions are used. The functions are defined based on the equations explained above. Once the data is split into two halves by split data function, entropy is calculated for each and every potential split. The potential split with lowest overall entropy is selected as best split with the help of determining the best split function

Determine type of feature:

Determine type of feature functions determine whether a type of feature is categorical or continuous. There are 2 criteria’s for the feature to be called as categorical, first if the feature is of data type string and second, the no. of categories for the feature is less than 10. Otherwise, the feature is said to be continuous.

Determine type of feature determines whether the function is categorical or not based on the above criteria which act as input to a potential split function. This is because potential split has a different way to handle categorical and continuous data as discussed in the potential split function above

 

Decision Tree Algorithm:

Since we have built all the helper functions, it’s now time to build the decision tree algorithm.

Target Decision Tree should look like:

target decision tree | Dimensionless

 

 

As shown in the above diagram, Decision Tree consists of several sub-trees. Each sub-tree is a dictionary where “Question” is key of the dictionary and there are two answers corresponding to each question i.e. Yes answer and No answer.

Representation of sub-tree is given as follows:

sub_tree= {“question”: [“yes_answer”, “no_answer”]}

Decision Tree Algorithm is the main algorithm which is used to train the model with the help of helper functions which we built previously. Firstly, Train Test Split function is called which divides the data set into train data and test data. Once the data is split, Data Pure and Classify function is called to check the purity of data and classify the data based on purity. The potential Split function gives all the potential splits for all variables. Overall Entropy is calculated for each potential split and eventually, potential split with lowest overall entropy is selected and split data function splits the data into two parts. In this way, the subtree is built and a series of sub-trees constitutes our target Decision Tree.

Once the model is trained on training dataset, the performance of Decision Tree is verified on the test data set with the help of classifying function. The performance of the model is measured in terms of accuracy by calculating accuracy function. The performance of the model can be improved by pruning the tree based on the max depth of the tree. The accuracy of the model is coming out to be 77%.