In the previous post, we looked at the basics of Linear Regression and the underlying assumptions behind the same. It is important to verify the assumptions in order to avoid faulty results and inferences. In this post, we talk about how to improvise our model using Regularization.
Motivation
What exactly are we trying to improve? Linear regression doesn’t essentially need parameter tuning. Consider the following experiment. We generate 10 evenly spaced numbers, from 0 to 10. We assign these numbers to two variables x and y. Consequently, we can establish a relationship between them as y=x. The relationship is obviously linear. Now we add some random Gaussian noise (mean 0) to our y for the sake of this experiment. So now we have
The linearity between y and x is now disrupted. For this experiment, we know the process that generated our data (shown in the picture below). But this might not be the case most of the time.
Now, if you recall the previous blog, a linear model would still fit well. Because the random noise has a mean equal to zero, the standard deviation of the noise is constant and hence there is no heteroskedasticity present. But what if we didn’t know any of this info about our data? Let’s assume that we still go ahead and fit a line (shown in the figure below)
As we see, the line doesn’t fit well. We go ahead and fit a polynomial of degree 2 (having a squared term and the linear terms). The fit looks something like this
For the above fit, our X data has two columns: x and x^2. We know that the R squared of this curve is at least as good as our linear fit. Because if it weren’t a better fit then the coefficient of the squared term would have been 0. Now let’s take a step ahead and fit a polynomial of degree 4.
This curve fits the data even better. It seems that the curve is able to capture the irregularities well.
It can be asserted that adding higher degree polynomials will most of the times give you a better fit for training data. If not better, it won’t degrade the metric you follow (Mean Squared Error). But what about generalization? Adding higher degree polynomials leads to overfitting. Many times, a particular feature(in this case, a particular power of X) dominates the model. The outcome is then dependent largely on this feature. Consequently, the model becomes too sensitive and does not generalize well. Hence, we need a way too control the dominance of our input features. By control, we mean a way to control the coefficients. More specifically, the magnitude of these coefficients. This process is called Regularization. In this article, we study two ways to achieve this goal – LASSO and Ridge.
These two are not very different except for the way they control the magnitude of the coefficients. Previously, we defined the loss function as follows –
Where N is the total number of training points, yi is the actual predicted value, yiis the value predicted by the model. In regularization, we add a term to our existing MSE. We will talk about this term in detail ahead.
Least Absolute Shrinkage and Selector Operator (LASSO)
In Lasso, the MSE looks like
It is the same expression as before, just that the predicted values are expressed as the sum of input features and their coefficients. Also, the additional term is the sum of magnitudes of coefficients multiplied by lambda, or the strength of regularization. Let’s explore this term in more detail.
We know that we train the linear regression model using gradient descent. Previously, we tried to minimize the squared difference term (the first term). Now, we also try to minimize the sum of the magnitude of coefficients. If you think carefully, the two terms roughly oppose each other. Consider the following – If some coefficients try to assume large values to minimize the sum of squares of the differences (the first term), the regularization term, or the sum of magnitudes of the coefficients increases. Ultimately, the sum isn’t changed as much as it would without the regularization. Ultimately, we obtain coefficients that generalize well. Now, what’s the role of the strength of regularization you may ask? It helps decide how much importance should be given to the regularization in comparison to the first term. If it is large, the model focuses more on reducing the sum of magnitudes of coefficients.
Ridge
Ridge differs in the way the regularization term is modeled. In LASSO, we summed the magnitudes of coefficients. Here we sum the squares of the coefficients. The modified loss function looks like
The high-level idea remains the same. Just the way gradient descent works changes, mathematically.
Analysis
One may ask, how do these two methods differ? The answer lies in the graph below
Image credits: http://freakonometrics.hypotheses.org/files/2018/07/lasso1.png
Consider the above plot. It represents the coefficient value vs. the regularization strength. After a certain lambda value, the coefficients start to shrink to zero value. Thus getting eliminated from the model itself. But in the case of Ridge,
Image credits: https://i.stack.imgur.com/E7BHo.png
The coefficients do not shrink to absolute zero. They are minimized, but not to absolute zero.
Conclusion
We have seen why do we need regularization at all, what problems does it solve and what are the methods in which it is implemented. We also saw how does it work. Data Science is hard if not done the right way. You can learn a million things, but you need someone to tell you which of those are the most important and which of these would you use in your job.
Thankfully, Dimensionless has just the right courses, with the right instructors to guide you to your first Data Science gig. Take your first step today by enrolling in our courses.
Follow the link, if you are looking to learn Data Science Online
You can follow the link for our Big Data course
Additionally, if you are having an interest in learning Data Science, click here to start Best Online Data Science Courses
Furthermore, if you want to read more about data science, read our Data Science Blogs