Machine learning paradigm is ruled by a simple theorem known as “No Free Lunch” theorem. According to this, there is no algorithm in ML which will work best for all the problems. To state, one can not conclude that SVM is a better algorithm than decision trees or linear regression. Selection of an algorithm is dependent on the problem at hand and other factors like the size and structure of the dataset. Hence, one should try different algorithms to find a better fit for their use case
In this blog, we are going to look into the top machine learning algorithms. You should know and implement the following algorithms to find out the best one for your use case
Top 10 Best Machine Learning Algorithms
1. Linear Regression
Regression is a method used to predict numerical numbers. Regression is a statistical measure which tries to determine the power of the relation between the label-related characteristics of a single variable and other factors called autonomous (periodic attributes) variable. Regression is a statistical measure. Just as the classification is used for categorical label prediction, regression is used for ongoing value prediction. For example, we might like to anticipate the salary or potential sales of a new product based on the prices of graduates with 5-year work experience. Regression is often used to determine how the cost of an item is affected by specific variables such as product cost, interest rates, specific industries or sectors.
The linear regression tries by a linear equation to model the connection between a scalar variable and one or more explaining factors. For instance, using a linear regression model, one might want to connect the weights of people to their heights
The driver calculates a linear pattern of regression. It utilizes the model selection criterion Akaike. A test of the comparative value of fitness to statistics is the Akaike information criterion. It is based on the notion of entropy, which actually provides a comparative metric of data wasted when a specified template is used to portray the truth. The compromise between bias and variance in model building or between the precision and complexity of the model can be described.
2. Logistic Regression
Logistic regression is a classification system that predicts the categorical results variable that may take one of the restricted sets of category scores using entry factors. A binomial logistical regression is restricted to 2 binary outputs and more than 2 classes can be achieved through a multinomial logistic regression. For example, classifying binary conditions as’ safe’/’don’t-healthy’ or’ bike’ /’ vehicle’ /’ truck’ is logistic regression. Logistic regression is used to create an information category forecast for weighted entry scores by the logistic sigmoid function.
The probability of a dependent variable based on separate factors is estimated by a logistic regression model. The variable depends on the yield that we want to forecast, whereas the indigenous variables or explaining variables may affect the performance. Multiple regression means an assessment of regression with two or more independent variables. On the other side, multivariable regression relates to an assessment of regression with two or more dependent factors.
3. Linear Discriminant Analysis
Logistic regression is traditionally a two-class classification problem algorithm. If you have more than two classes, the Linear Discriminant Analysis algorithm is the favorite technique of linear classification. It contains statistical information characteristics, which are calculated for each category.
For a single input variable this includes:
The mean value for each class.
The variance calculated across all classes.
The predictions are calculated by determining a discriminating value for each class and by predicting the highest value for each class. The method implies that the information is distributed Gaussian (bell curve) so that outliers are removed from your information in advance. It is an easy and strong way to classify predictive problem modeling.
4. Classification and Regression Trees
Prediction Trees are used to forecast answer or YY class of X1, X2,…, XnX1,X2,… ,Xn entry. If the constant reaction is called a regression tree, it is called a ranking tree, if it is categorical. We inspect the significance of one entry XiXi at any point of the tree and proceed to the left or to the correct subbranch, based on the (binary) response. If we hit a tree, we will discover the forecast (generally the leaves as the most popular value of the accessible courses is a straightforward statistical figure of the dataset). In contrast to global model linear or polynomial regression (a predictive formula should be contained in the whole data space), trees attempt to split the data space in a sufficiently small part, where a simply different model can be applied on each side. For each xx information, the non-leaf portion of the tree is simply the process to determine what model we use for the classification of each information (i.e. which leaf).
5. Naive Bayes
A Naive Bayes classification is a supervised algorithm for machinery-learning which utilizes the theorem of Bayes, which implies statistical independence of its characteristics. The theorem depends on the naive premise that input factors are autonomous from each other, that is, that when an extra variable is provided there is no way to understand anything about other factors. It has demonstrated to be a classifier with excellent outcomes regardless of this hypothesis. The Bavarian Theorem, relying on a conditional probability, or in easy words, is used for the Naive Bayes classifications as a probability of a case (A) occurring considering that another incident (B) has already occurred. In essence, the theorem enables an update of the hypothesis every moment fresh proof is presented.
The equation below expresses Bayes’ Theorem in the language of probability:
Let’s explain what each of these terms means.
“P” is the symbol to denote probability.
P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
P(A) = The probability of event B (hypothesis) occurring.
P(B) = The probability of event A (evidence) occurring.
6. K-Nearest Neighbors
The KNN is a simple machine study algorithm which classifies an entry using its closest neighbours. The input of information points of particular males and women’s height and weight as shown below should be provided, for instance, by a k-NN algorithm. K-NN can peer into the closest k neighbour (personal) and determine if the entry gender is masculine in order to determine the gender of an unidentified object (green point). This technique is extremely easy and logical, with a strong achievement level for labelling unidentified input.
k-NN is used in a range of machine learning tasks; k-NN, for example, can help in computer vision in hand-written letters and the algorithm is used to identify genes that are contributing to a specific characteristic of the gene expression analysis. Overall, neighbours close to each other offer a mixture of ease and efficiency that makes it an appealing algorithm for many teaching machines.7. Learning Vector Quantization
8. Bagging and Random Forest
A Random Forest is a group of easy tree predictors, each of which is capable of generating an answer when it has a number of predictor values. This reaction requires the form of a class affiliation for classification issues, which combines or classifies a number of indigenous predictor scores with one of the classifications in the dependent variable. Otherwise, the tree reaction is an assessment of the dependent variable considering the predictors for regression difficulties. Breiman has created the Random Forest algorithm.
An arbitrary amount of plain trees are a random forest used to determine the ultimate result. The ensemble of easy trees votes for the most common category for classification issues. Their answers are averaged to get an assessment of the dependent variable for regression problems. With tree assemblies, the forecast precision (i.e. greater capacity to detect fresh information instances) can improve considerably.
The support vector machine(SVM) is a supervised, classifying, and regressing machine learning algorithm. In classification issues, SVMs are more common, and as such, we shall be focusing on that article.SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.
You can think of a hyperplane as a line that linearly separates and classifies a set of data.
The more our information points are intuitively located from the hyperplane, the more assured that they have been categorized properly. We would, therefore, like to see as far as feasible from our information spots on the right hand of the hyperplane.
So when new test data are added, the class we assign to it will be determined on any side of the hyperplane.
The distance from the hyperplane to the nearest point is called the margin. The aim is to select a hyperplane with as much margin as feasible between the hyperplane and any point in the practice set to give fresh information a higher opportunity to be properly categorized.
But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.
It is essential to switch from a 2D perspective to a 3D perspective to classify a dataset like the one above. Another streamlined instance makes it easier to explain this. Imagine our two balls stood on a board and this booklet is raised abruptly, throwing the balls into the air. You use the cover to distinguish them when the buttons are up in the air. This “raising” of the balls reflects a greater identification of the information. This is known as kernelling.
Our hyperplanes can be no longer a row because we are in three dimensions. It should be a flight now, as shown in the above instance. The concept is to map the information into greater and lower sizes until a hyperplane can be created to separate the information.
10. Boosting and AdaBoost
Boosting is an ensemble technology which tries to build a powerful classification of a set of weak classifiers. This is done using a training data model and then a second model has created that attempts the errors of the first model to be corrected. Until the training set is perfectly predicted or a maximum number of models are added, models are added.
AdaBoost was the first truly effective binary classification boosting algorithm. It is the best point of start for improvement. Most of them are stochastic gradient boosters, based on AdaBoost modern boosting techniques.
With brief choice trees, AdaBoost is used. After the creation of the first tree, each exercise instance uses the performance of the tree to weigh how much attention should be given to the next tree to be built. Data that are difficult to forecast will be provided more weight, while cases that are easily predictable will be less important. Sequentially, models are produced one by one to update each of the weights on the teaching sessions which impact on the study of the next tree. After all, trees have been produced, fresh information are predicted and how precise it was on the training data weighs the efficiency of each tree.
Since the algorithm is so careful about correcting errors, it is essential that smooth information is deleted with outliers.
In the end, every beginner in data science has one basic starting questions that which algorithm is best for all the cases. The response to the issue is not straightforward and depends on many factors like information magnitude, quality and type of information; time required for computation; the importance of the assignment; and purpose of information
Even an experienced data scientist cannot say which algorithm works best before distinct algorithms are tested. While many other machine learning algorithms exist, they are the most common. This is a nice starting point to understand if you are a beginner for machine learning.
When I started my Data Science journey, I casually Googled ‘Application of Machine Learning Algorithms’. For the next 10 minutes, I had my jaw hanging. Quite literally. Part of the reason was that they were all around me. That music video from YouTube recommendations that you ended up playing hundred times on loop? That’s Machine Learning for you. Ever wondered how Google keyboard completes your sentence better than your bestie ever could? Machine Learning again!
So how does this magic happen? What do you need to perform this witchcraft? Before you move further, let me tell you who would benefit from this article.
Someone who has just begun her/his Data Science journey and is looking for theory and application on the same platter.
Someone who has a basic idea of probability and linear algebra.
Someone who wants a brief mathematical understanding of ML and not just a small talk like the one you did with your neighbour this morning.
Someone who aims at preparing for a Data Science job interview.
‘Machine Learning’ literally means that a machine (in this case an algorithm running on a computer) learns from the data it is fed. For example, you have customer data for a supermarket. The data consists of customers age, gender, time of entry and exit and the total purchase. You train a Machine Learning algorithm to learn the purchase pattern of customers and predict the purchase amount for a new customer by asking for his age, gender, time of entry and exit.
Now, Let’s dig deep and explore the workings of it.
Machine Learning (ML) Algorithms
Before we talk about the various classes, let us define some terms:
Seen data or Train Data –
This is all the information we have. For example, data of 1000 customers with their age, gender, time of entry and exit and their purchases.
Predicted Variable (or Y) –
The ML algorithm is trained to predict this variable. In our example, the ‘Purchase amount’. The predicted variable is usually called the dependent variable.
Features (or X) –
Everything in the data except for Y. Basically, the input that is fed to the model. Features are usually called the independent variable.
Model Parameters –
Parameters define our ML model. This will be understood later as we discuss each model. For now, remember that our main goal is to evaluate these parameters.
Unseen data or Test Data–
This is the data for which we have the X but not Y. The why has to be predicted using the ML model trained on the seen data.
Now that we have defined our terms, let’s move to the classes of Machine Learning or ML algorithms.
Supervised Learning Algorithms:
These algorithms require you to feed the data along with the predicted variable. The parameters of the model are then learned from this data in such a way that error in prediction is minimized. This will be more clear when individual algorithms are discussed.
Unsupervised Learning Algorithms:
These algorithms do not require data with predicted variables. Then what do we predict? Nothing. We just cluster these data points.
If you have any doubts about the things discussed above, keep on reading. It will get clearer as you see examples.
A general strategy used for setting parameters for any ML algorithm in general. You take out a small part of your training (seen) data, say 20%. You train an ML model on the 80% and then check it’s performance on that 20% of data (remember you have the Y values for this 20 %). You tweak the parameters until you get minimum error. Take a look at the flowchart below.
Supervised Learning Algorithms
In Supervised Machine Learning, there are two types of predictions – Regression or Classification. Classification means predicting classes of a data point. For example – Gender, Type of flower, Whether a person will pay his credit card bill or not. The predicted variable has 2 or more possible discrete values. Regression means predicting a numeric value of a data point. For example – Purchase amount, Age of a person, Price of a house, Amount of predicted rainfall, etc. The predicted class is a continuous variable. A few algorithms perform one of either task. Others can be used for both the tasks. I will mention the same for each algorithm we discuss. Let’s start with the most simple one and slowly move to more complex algorithms.
KNN: K-Nearest Neighbours
“You are the average of 5 people you surround yourself with”-John Rim
Congratulations! You just learned your first ML algorithm.
Don’t believe me? Let’s prove it!
Consider the case of classification. Let’s set K, which is the number of closest neighbours to be considered equal to 3. We have 6 seen data points whose features are height and weight of individuals and predicted variable is whether or not they are obese.
Consider a point from the unseen data (in green). Our algorithm has to predict whether the person represented by the green data point is obese or not. If we consider it’s K(=3) nearest neighbours, we have 2 obese (blue) and one not obese (orange) data points. We take the majority vote of these 3 neighbours which is ‘Yes’. Hence, we predict this individual to be obese. In case of regression, everything remains the same, except that we take the average of the Y values of our K neighbours. How to set the value of K? Using cross-validation.
Euclidean distance is used for computing distance between continuous variables. If the data has categorical variables (gender, for example), Hamming distance is used for such variables. There are many popular distance measures used apart from these. You can find a detailed explanation here.
This is yet another simple, but an extremely powerful model. It is only used for regression purposes. It is represented by
Y’ is the value of the predicted variable according to the model. X1, X2,…Xn are input features. Wo, W1..Wn are the parameters (also called weights) of the model. Our aim is to estimate the parameters from the training data to completely define the model.
How do we do that? Let’s start with our objective which is to minimize the error in the prediction of our target variable. How do we define our error? The most common way is to use the MSE or Mean Squared Error –
For all N points, we sum the squares of the difference of the predicted value of Y by the model, i.e. Y’ and the actual value of the predicted variable for that point, i.e. Y.
We then replace Y’ with equation (1) and differentiate this MSE with respect to parameters W0,W1..Wn and equate it to 0 to get values of the parameters at which the error is minimum.
An example of how a linear regression might look like is shown below.
Sometimes it is not necessary that our dependent variable follows linear dependency on our independent variable. For example, Weight in the above graph may vary with the square of Height. This is called polynomial regression (Y varies with some power of X).
Good news is that any polynomial regression can be transformed to linear regression. How?
We transform the independent variable. Take a look at the Height variable in both the tables.
We will forget about Table 1 and treat the polynomial regression problem like a linear regression problem. Only this time, Weight will be linear in Height squared (notice the x-axis in the figure below).
A very important question about every ML model one should ask is – How do you measure the performance of the model? One popular measure is R-squared
R-squared: Intuitively, it measures how well the data and hence the model explains the variation in the dependent variable. How? Consider the following question – If you had just the Y values and no X values in your data, and someone asks you, “Hey! For this X, what would you predict the Y to be?” What would be your best guess? The average of all the Y values you have! In this scenario of limited information, you are better off guessing the average of Y for any X than anything other value of Y.
But, now that you have X and Y values, you want to see how well your linear regression model predicts Y for any unseen X. R-squared quantifies the performance of your linear regression model over this ‘baseline model’
MSE is the mean squared error as discussed before. TSE is the total squared error or the baseline model error.
Naive Bayes is a classification algorithm. As the name suggests, it is based on Bayes rule.
Intuitive Breakdown of Bayes rule: Consider a classification problem where we are asked to predict the class of a data point x. We have two classes and the classes are denoted by letter C.
Now, P(c), also known as the ‘Prior Probability’ is the probability of a data point belonging to class C, when we don’t have any data. For example, if we have 100 roses and 200 sunflowers and someone asks you to classify an unseen flower while providing you with no information, what would you say?
P(rose) = 100/300 = ⅓ P(sunflower) = 200/300 = ⅔
Since P(sunflower) is higher, your best guess would be a sunflower. P(rose) and P(sunflower) are prior probabilities of the two classes.
Now, you have additional information about your 300 flowers. The information is related to thorns on their stem. Look at the table below.
Rose (Total 100)
Sunflower (Total 200)
Now come back to the unseen flower. You are told that this unseen flower has thorns. Let this information about thorns be X.
Now according to Bayes rule, the numerator for the two classes are as follows –
Rose = 1/3*9/10 = 3/10 = 0.3
Sunflower = 2/3*1/3 = 2/9 = 0.22
The denominator, P(x), called the evidence is the cumulative probability of seeing the data point itself. In this case it is equal to 0.3 + 0.22 = 0.52. Since it does not depend on the class, it won’t affect our decision-making process. We will ignore it for our purposes.
P(Rose|X) > P(sunflower|X)
Therefore, our prediction would be that the unseen flower is a Rose. Notice that our prior probabilities of both the classes favoured Sunflower. But as soon as we factored the data about thorns, our decision changed.
If you understood the above example, you have a fair idea of the Naive Bayes Algorithm.
This simple example where we had only one feature (information about thorns) can be extended to multiple features. Let these features be x1, x2, x3 … xn. Bayes Rule would look like –
Note that we assume the features to be independent. Meaning,
The algorithm is called ‘Naive’ because of the above assumption
Logistic regression, unlike its name, is used for classification purposes. The mathematical model used for logistic regression is called the logit function. Consider two classes 0 and 1.
P(y=1) denotes the probability of belonging to class 1 and 1-P(y=1) is thus the probability of the data point belonging to class 0 (notice that the range of the function for all WT*X is between 0 and 1). Like other models, we need to learn the parameters w0, w1, w2, … wn to completely define the model. Like linear regression has MSE to quantify the loss for any error made in the prediction, logistic regression has the following loss function –
P is the probability of a data point belonging to class 1 as predicted by the model. Y is the actual class of the model.
Think about this – If the actual class of a data point is 1 and the model predicts P to be 1, we have 0 loss. This makes sense. On the other hand, if P was 0 for the same data point, the loss would be -infinity. This is the worst case scenario. This loss function is used in the Gradient Descent Algorithm to reach the parameters at which the loss is minimum.
Okay! So now we have a model that can predict the probability of an unseen data point belonging to class 1. But how do we make a decision for that point? Remember that our final goal is to assign classes, not just probabilities.
At what probability threshold do we say that the point belongs to class 1. Well, the model assigns the class according to the probabilities. If P>0.5, the class if obviously 1. However, we can change this threshold to maximize the metric of our interest ( precision, recall…), we can choose the best threshold using cross-validation.
This was Logistic Regression for you. Of course, do follow the coding tutorial!
“Suppose there exist two explanations for an occurrence. In this case, the one that requires the least speculation is usually better.” – Occam’s Razor
The above philosophical principle precisely guides one of the most popular supervised ML algorithm. Decision trees, unlike other algorithms, are non-parametric algorithms. We don’t necessarily need to specify any parameter to completely define the model unlike KNN (where we need to specify K).
Let’s take an example to understand this algorithm. Consider a classification problem with two classes 1 and 0. The data has 2 features X and Y. The points are scattered on the X-Y plane as
Our job is to make a tree that asks yes or no questions to a feature in order to create classification boundaries. Consider the tree below:
The tree has a ‘Root Node’ which is ‘X>10’. If yes, then the point lands at the leaf node with class 1. Else it goes to the other node where it is asked if its Y value is <20. Depending on the answer, it goes to either of the leaf nodes. Boundaries would look something like –
How to decide which feature should be chosen to bifurcate the data? The concept of ‘Purity ‘ is used here. Basically, we measure how pure (pure in 0s or pure in 1s) our data becomes on both the sides as compared to the node from where it was split. For example, if we have 50 1s and 50 0s at some node. After splitting, we have 40 1s and 10 0s on one side and 10 1s and 40 0s on the other, then we have a good splitting (one node is purer in 1s and the other in 0s). This goodness of splitting is quantified using the concept of Information Gain. Details can be found here.
If you have come so far, awesome job! You now have a fair level of understanding of basic ML algorithms along with their applications in Python. Now that you have a solid foundation, you can easily tackle advanced algorithms like Neural Nets, SVMs, XGBoost and many others.