`Machine learning`

is a new buzz in the industry. It has a wide range of applications which makes this field a lot more competitive. Staying in the competition requires you to have a sound knowledge of the existing and an intuition for the non-existing. Well, it’s relieving that getting familiar with the existing is not that difficult given the right strategy. Climbing up the ladder step by step is the best way to reach the sky.

Mastering `data analytics`

is not that difficult and that mathematical either. You do not need a PhD to understand the fancier ML algorithms (Though inventing a new one might ask you for it). Most of us start out with regression and climb our way up. There is a quote, “Abundant data generally belittles the importance of algorithm”. But we are not always blessed with the abundance. So, we need to have a good knowledge of all the tools and an intuitive sense for their applicability. This post aims at explaining one more such tool, **Support Vector Machine**.

## Table of contents

- What is SVM?
- How does it work?
- Implementation in R.
- Pros and Cons?
- Applications

## What is SVM?

A Support Vector Machine is a yet another supervised machine learning algorithm. It can be used for both regression and classification purposes. But SVMs are more commonly used in classification problems (This post will focus only on classification). Support Vector machine is also commonly known as “Large Margin Classifier”.

## How does it work?

Support Vectors and Hyperplane

Before diving deep, let’s first undertand “What is a Hyperplane?”. A hyperplane is a flat subspace having dimensions one less than the dimensions of co-ordinate system it is represented in.

In a 2-D space, hyperplane is a line of the form \(A_0\) + \(A_1\)\(X_1\) + \(A_2\)\(X_2\) = 0 and in a m-D space, hyperplane is of the form \(A_0\) + \(A_1\)\(X_1\) + \(A_2\)\(X_2\) + …. + \(A_m\)\(X_m\) = 0

Support Vector machines have some special data points which we call “Support Vectors” and a separating hyperplane which is known as “Support Vector Machine”. So, essentially SVM is a frontier that best segregates the classes.

Support Vectors are **the data points nearest to the hyperplane, the points of our data set which if removed, would alter the position of the dividing hyperplane**. As we can see that there can be many hyperplanes which can segregate the two classes, the hyperplane that we would choose is the one with the highest margin.

The Kernel Trick

We are not always lucky to have a dataset which is lineraly separable by a hyperplane. Fortunately, SVM is capable of fitting non-inear boundaries using a simple and elegant method known as kernel trick. In simple words, it projects the data into higher dimension where it can be separated by a hyperplane and then project back to lower dimensions.

Here, we can imagine an extra feature ‘z’ for each data point “(x,y)” where \(z^{2} = x^{2}+y^{2}\)

We have in-built kernels like rbf, poly, etc. which projects the data into higher dimensions and save us the hard work.

SVM objective

Support Vector Machine try to achieve the following two classification goals simultaneously:

- Maximize the margin (see fig)
- Correctly classify the data points.

There is a loss function which takes into account the loss due to both, ‘a diminishing margin’ and ‘in-correctly classified data point’. There are hyperparameters which can be set for a trade off between the two.

Hyperparameters in case of SVM are:

**Kernel**– “Linear”, “rbf” (default), “poly”, etc. “rbf” and “poly” are mainly for non- linear hyper-plane.**C(error rate)**– Penalty for wrongly classified data points. It controls the trade off between a smoother decision boundary and conformance to test data.**Gamma**– Kernel coefficient for kernels (‘rbf’, ‘poly’, etc.). Higher values results in overfitting.

Note: Explaining the maths behind the algortihm is beyond the scope of this post.

Some examples of SVM classification

- A is the best hyperplane.
- Fitting non-linear boundary using Kernel trick.
- Trade off between smooth booundary and correct classification.

## Implementation in R.

Below is a sample implementation in R using the IRIS dataset.

1 2 |
#Using IRIS dataset head(iris, 3) |

1 2 3 4 |
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa |

1 2 3 4 |
#For simplicity of visualization(2-D), let us use only two feature "Sepal.length" and "Sepal.width" for prediction of "Species" iris.part = iris[,c(1,2,5)] attach(iris.part) head(iris.part, 3) |

1 2 3 4 |
## Sepal.Length Sepal.Width Species ## 1 5.1 3.5 setosa ## 2 4.9 3.0 setosa ## 3 4.7 3.2 setosa |

1 2 3 |
#Plot our data set plot(Sepal.Width, Sepal.Length, col=Species) legend(x = 3.9, y=7.5, legend = c("Setosa", "versicolor", "verginica"),fill = c('white','red','green')) |

1 2 3 4 5 6 7 8 9 |
x <- subset(iris.part, select=-Species) #features to use y <- Species #feature to predict #Create a SVM Model #For simplicity, data is not splitted up into train and test sets. #In practical scenarios, split the data into training, cross validation and test dataset model <- svm(Species ~ ., data=iris.part) summary(model) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
## ## Call: ## svm(formula = Species ~ ., data = iris.part) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.5 ## ## Number of Support Vectors: 86 ## ## ( 10 40 36 ) ## ## ## Number of Classes: 3 ## ## Levels: ## setosa versicolor virginica |

1 2 |
#Predict the Species y_pred <- predict(model,x) |

1 2 3 4 |
#Tune SVM to find the best hyperparameters tune_svm <- tune(svm, train.x=x, train.y=y, kernel="radial", ranges=list(cost=10^(-2:2), gamma=c(.25,.5,1,2))) print(tune_svm) |

1 2 3 4 5 6 7 8 9 10 |
## ## Parameter tuning of 'svm': ## ## - sampling method: 10-fold cross validation ## ## - best parameters: ## cost gamma ## 0.1 0.5 ## ## - best performance: 0.2066667 |

1 2 3 4 5 6 |
#After you find the best cost and gamma, you can set the best found parameters final_svm <- svm(Species ~ ., data=iris.part, kernel="radial", cost=1, gamma=1) #Plot the results plot(final_svm , iris.part) legend(x = 3.37, y=7.5, legend = c("Setosa", "versicolor", "verginica"),fill = c('white','red','green')) |

1 |
#crosses in plot indicate support vectors. |

1 2 3 4 5 6 |
#Try changing the kernel to linear final_svm_linear <- svm(Species ~ ., data=iris.part, kernel="linear", cost=1, gamma=1) #Plot the results plot(final_svm_linear , iris.part) legend(x = 3.37, y=7.5, legend = c("Setosa", "versicolor", "verginica"),fill = c('white','red','green')) |

1 2 3 4 5 6 7 8 |
#Try changing C and gamma final_svm <- svm(Species ~ ., data=iris.part, kernel="radial", cost=100, gamma=100) #high C and gamma leads to overfitting #Plot the results plot(final_svm , iris.part) legend(x = 3.37, y=7.5, legend = c("Setosa", "versicolor", "verginica"),fill = c('white','red','green')) |

I highly recommend you to play with this data set by changing kernels and trying different values of `cost`

and `gamma`

. This will increase your understanding of hyperparameter tuning.

## Pros and Cons?

### Pros:

- Easy to train as it uses only a subset of training points.
- Proven to work well on small and clean datasets.
- Solution is guaranteed to be global minima (it solves a convex quadratic problem)
- Non – linear decision boundaries can be obtained using kernel trick.
- Custom controllable parameter to find an optimal balance between error rate and high margin
- Can capture much more complex relationships between data points without having to perform difficult transformations ourselves

### Cons:

- Cannot scale well on larger datasets as training time is higher.
- Less effective for datasets with noise and classes overlapping.
- Complex data transformations and resulting boundary plane are very difficult to interpret (Black box magic).

## Applications

Support Vector Machine is a versatile algorithm and has successfully been implemented for various classification problems. Some examples are:

- Spam detection.
- Sentiment detection.
- Handwritten digits recognition
- Image processing and image recognition.

## Additional resources:

I highly recommend you to go through the links below for an in-depth understanding of the Maths behind this algorithm.

// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });