In this blog, I will try to make this concept of regression simple and intuitive for everyone. Understanding maths behind a concept is always a must before focusing on its implementation. In this part, we will try to understand what actually regression is and I will walk you through on creating your first machine learning model and understanding it in and out. Hold tight because this will be something you will not wish to miss out.
Linear regression is one of the most widely known modeling technique and was the first one I learned like others while putting forward my steps in predictive learning. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the nature of the regression line is linear. We try to figure out a relationship between our dependent variables and independent variable using a best fitting straight line. It is called “simple” because we take only two variables in this case and try to establish a relationship between them. It is called “linear” because both the variables vary linearly (can be described with a straight line) with respect to each other. So a while plotting the relationship between two variables, we will be observing more of straight lines rather than curves.
Before diving directly into regression, let us focus on some math behind linear regression. A simple straight line can be resembled by an equation like
where y is your dependent variable
x is your independent variable or can be called regressor variable or predictor variable
m is a coefficient resembling how a unit change in x brings a change in y
c is a constant which determines where your line will cut the x-axis when y=0.
To understand this relationship between our independent variable(x) and the dependent variable(y), linear regression can help us greatly.
Let us solve a problem using linear regression and understand its concepts throughout the journey
Problem: I have shifted to a new city and cab prices here from my apartment to my office are varying monthly. I want to understand the cause of these fluctuations in cab prices and if somehow I can predict what could be the price I will be paying for my cab in the coming month
In this case, our independent variable(x) will be months and our dependent variable(y) will be the cab price. Always remember that your dependent variable is the one which you are trying to predict (cab price) and your independent variable (months) is the one which is used to predict your independent variable.
We can describe above plot roughly as
In linear regression, (in very general terms) our aim here is to find a straight line which will cover most of the points in the above graph. It is called “linear” because coefficients in this case (m) are always linear. It is good to use when your dependent and independent variables have a linear relationship i.e. it can be explained with the help of a continuous straight line
Every actual point which we have plotted is represented as Y (represented in blue) and the points which are predicted by our linear regression will be termed as Y^.
Y = actual value
Y^ = predicted value by model
The goal is to calculate the difference between Y and Y^ for every point, square them and sum them up for every line. Squaring the difference here solves two purposes here.
1. It will heavily punish the lines which will be further away from the actual points.
2. It will compensate for the actual points which lie below the regression line. (Y<Y^)
The line giving the minimum sum is chosen as the best fitting line. This method is also known as ordinary least square (OLS). This method in plain words finds from all possible fitting lines, the line with least distance from all the points.
Above equation is the best fitting line which passes through most of the data points. All the blue points represent the actual cab price and the points on the line represent the predicted cab price by our linear regression model. As said previously, a best fitting line will have the least distance between the actual point and the predicted point. Now we can predict, cab price for any coming month say the 14th month from the point I have come to this new city, by just replacing month with 14 in the above equation.
Seems like I now have an estimate on how much cab price I would be paying after 2 months from now on. Hmm, interesting!
Understanding Correlation
Before building our first model, it is important to understand correlation too. Correlation is nothing but an indicator of the strength of association and relationship between two variables. In simple words, it tells you whether a certain variable will increase or decrease given a change in another variable. Value of correlation coefficient remains between -1 and 1.
- -1 indicates a strong negative relationship between two variables
- 0 indicates no relationship between the two variables
- +1 indicates a strong positive correlation between the two variables
Let us take an example for each.
Strong negative correlation: Relationship between say the speed of the cab I take for office and time taken to reach office. As the speed of the cab increases, the time taken by me to reach my office decreases.
Strong positive correlation: Relationship between the amount of fuel put in the cab with the distance it can cover. More fuel will lead it to cover the further distance.
But hey, there is a small trap! I did fell into this trap when I began my career in data science and I wish to highlight that here. Keep the below line inked in your mind forever.
CORRELATION DOES NOT IMPLY CAUSATION ALWAYS
It is a very important line to keep in mind whenever you are dealing with correlation. Correlation will tell you how two variables are changing with respect to each other but one should never jump to conclusion that one is changing because of the other.
For example, last month a strong positive correlation was observed between increasing number of shark attacks and iPhone sales. It will not make any sense that increasing shark attack can lead to huge sales of iPhones.
As you can see in this graph also, with a decrease in market share of internet explorer, the murder rate has also gone down. Though it clearly states a negative correlation in no way one can conclude that murders in the US have something to do with the fall of internet explorer market share in the US and vice-versa.
Let us not wait more and get our hands dirty with some code. We will train a simple linear regression model which will find a correlation between the two columns and by understanding that correlation, we will be finally able to predict cab prices for months to come.
Building a simple linear regression model in R
Step 1 : Import the data set and use functions like summary() and colnames() to understand the data. The summary provides a detailed summary of every column in your data set in R. In our dataset, we have two columns i.e. Months and Cab. Price. The month is our independent variable whereas Cab. Price is the variable we are trying to predict.
# Importing the dataset dataset = read.csv('data.csv') summary(dataset) colnames(dataset)
Step 2: Split your data set into a training set and a test set. It is a good practice to divide your data set in such a way that you have more number of entries for your training set as compared to the test set. A 75%-25% split is considered generally but it can vary accordingly in case of different scenarios.
# Splitting the dataset into the Training set and Test set library(caTools) set.seed(123) split = sample.split(dataset$Cab.Price, SplitRatio = 2/3) training_set = subset(dataset, split == TRUE)-- test_set = subset(dataset, split == FALSE)
Step 3: Use the lm function in R to build a basic simple linear regression model. The first parameter (formula) requires us to write our dependent variable first then followed by “~” and then our independent variable. In the second parameter(data), we pass the data object to the lm function.
# Fitting Simple Linear Regression to the Training set regressor = lm(formula = Cab.Price ~ Months, data = dataset)
Step 4: It is time now to do some predictions using the model we just built. regressor object above now holds your simple linear regression model and now can be used to make predictions easily. R provides a predict function which can be used here to make predictions. It is a fairly simple function which will accept your model as the first parameter and your test dataset as the second parameter.
# Predicting the Test set results y_pred = predict(regressor, newdata = test_set)
Step 5: Visualizing your results in one of the important methods to analyze your predictions. We will first try to have a look performance of our model on our training set itself to analyze its performance
library(ggplot2) ggplot() + geom_point(aes(x = dataset$Months, y = dataset$Cab.Price), colour = 'blue') + geom_line(aes(x = dataset$Months, y = predict(regressor, newdata = training_set)), colour = 'black') + theme_bw()+ ggtitle('Cab price vs Time (Training set)') + xlab('Time(in months)') + ylab('Cab Price')
Step 6 : Now to compare these values with the values predicted by our model, we plot the straight line from our model which was built with the training set
# Visualising the Test set results library(ggplot2) ggplot() + geom_point(aes(x = test_set$Months, y = test_set$Cab.Price), colour = 'blue') + geom_line(aes(x = test_set$Months, y = predict(regressor, newdata = training_set)), colour = 'black') + theme_bw()+ ggtitle('Cab price vs Time (Training set)') + xlab('Time(in months)') + ylab('Cab Price')
Our final model is able to predict cab price for different months and its prediction is quite close to many values either.
Understanding our simple linear regression model
Let us deep dive into one level more to actually understand our model. The simplest way to analyze our simple linear regression model is to read the result of the summary function when your model is passed as a parameter.
summary(regressor)
After executing this command, you will observe different statistics displayed in your R console. Do not get confused on seeing these different numbers and terms. Let us go through each of them one by one.
- Formula Call
It shows formula used for making the model. It displays the lm() function we used to make our simple linear regression model with Cab. Price as our dependent variable and Month as the independent variable. - Residuals
It is nothing but the difference in actual values which were originally present for example actual cab price in our dataset and the predicted values by the simple linear regression model. In order to analyze how well your model is fitting your data, you should look at symmetrical distribution of these values around mean 0. In this case, values are more or less evenly distributed around 0, which implies that your model is predicting values which are closer to actual points. - Coefficients
Coefficients are basically the values which determine how much unit change in one independent variable will lead to a unit change in the dependent variable.
– Estimate provides coefficients used in the actual equation of the model. Intercept estimates give value or the cab price when month value will be 0 and month intercept gives the value of the coefficient of the variable month. Month intercept says that for every increase in a month there will be an increase of cab price by 4.914.
- Standard Error: As we have already concluded that for every increase in a month there will be an increase of cab price by 4.914(INR), I want to measure an estimate of the standard difference in cab price value if we run our model again and again. By running my model multiple times, we will observe that our predicted cab price may vary by 0.331(INR)
- T-value: The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis that is, we could declare a relationship between cab price and time exist. In our example, the t-statistic values are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute p-values.
- Pr> T: The Pr(>t) acronym found in the model output relates to the probability of observing any value equal to or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (months) and response (cab price) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated with each estimate. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.
- Residual Standard Error: It is the average amount by which dependent variable(cab price) will deviate from the true regression line. In simpler words, on average cab price in a month can vary by 5.539(INR) from the true regression line. Prediction error can also be calculated by dividing 5.539(Residual error) with 77.752(Estimate coefficient) which is around 7.12%. Also, observe that this is on the basis of 13 degrees of freedom. The degree of freedom is data points that went into the estimation of the dependent variable(cab price) after taking into the account the number of parameters(which are two in this case i.e. cab price and months). So the degree of freedom will be 15(no. of data points)–2(no. of parameters used for prediction) = 13.
- Multiple R-squared, Adjusted R-squared: R squared value gives the estimation of how well your model is fitting your data. It always lies between 0 and 1. 0 value will tell us that your independent variable(months) does not explain the variance present in the response variable(cab price) whereas value 1 tells us that observed variance in the response variable(cab price) can be explained by dependent variable(months). R squared value in our case is around 0.9443. It means that 94.43% variance found in the cab price can be explained using time(months). Generally, more is the R squared, better your linear regression model. Adjusted R square is a modified type of R squared value which is used in case of multiple linear regression models as R square tends to increase with more number of independent/predictor variables. I will be discussing more Adjusted R square and maths behind it in my next article for multiple linear regression model.
- F-statistic: f statistic also states whether there is any relationship between a predictor variable(time) and the response variable(cab price). Larger the f-statistic than 1, more the relationship between the two variables. If your data set is large then a value slightly greater than 1 is sufficient to prove the existence of the relationship between a predictor variable(time) and the response variable(cab price). But in the case when the data set is small, the value should be sufficiently higher than 1. In our case where the data set is small, f statistic value is around 220 which is sufficiently large enough to prove that there exists a relationship between cab price and time(months)
So we were able to understand and build our first machine learning model and able to predict the cab price for my office ride in the coming month. We will be discussing more multiple linear regression in my next article where the cab price will not be just dependent on time but on a lot of other factors also and we will look at ways on how to accommodate these variables for our final cab price prediction. We will also look into some basic assumptions which need to be satisfied before making a linear regression model and understand over-fitting too. A lot of exciting things coming up in the next article, stay tuned!