9923170071 / 8108094992 info@dimensionless.in

Multiple Linear Regression & Assumptions of Linear Regression: A-Z

In this article, we will be covering multiple linear regression model. We will also look at some important assumptions that should always be taken care of before making a linear regression model. We will also try to improve the performance of our regression model.

Multiple linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression is to model the relationship between the dependent and independent variables. We call it “multiple” because in this case, unlike simple linear regression, we have many independent variables trying to predict a dependent variable. For example, predicting cab price based on fuel price, vehicle cost and profits made by cab owners or predicting salary of an employee based on previous salary, qualifications, age etc.

Multiple linear regression models can be depicted by the equation

  • y is a dependent variable which you are trying to predict. In linear regression, it is always a numerical variable and cannot be categorical
  • x1x2, and x3 are independent variables which are taken into the consideration to predict the dependent variable y
  • a1a2a3 are coefficients which determine how a unit change in one variable will independently bring change in the final value you are trying to predict i.e a change in x3 by 1 unit will bring a change in our dependent variable y by an amount “a3” assuming a1 and a2 have remained constant
  • a0 is also known as constant or the intercept value i.e. the value our predictor will output if there is no effect of any independent variables(x1, x2, x3) on our dependent variable(y)

Problem Statement: Predict cab price from my apartment to my office which has been off late fluctuating. Unlike my previous article on simple linear regression, cab price now not just depends upon the time I have been in the city but also other factors like fuel price, number of people living near my apartment, vehicle price in the city and lot other factors. We will try to build a model which can take all these factors into the consideration.

Let us start analyzing the data we have for this case. If you are new to regression and want to understand the basics, you may like to visit my previous article on the basics of regression.

Loading data set
Load data set and study the structure of data set. “dim” function shows the dimension of the dataset. You can also see in the console that dim outputs result 15,8 meaning our data set has 15 rows and 8 columns. “summary” function, on the other hand, gives more detailed information about every column in the dataset. We should always try to understand the data first before jumping directly to the model building.

dataset <- read.csv("data.csv")
dim(dataset)
summary(dataset)

Handling categorical variables
We should always keep in mind that regression will take only continuous and discrete variables as input. In our case, 3 of our variables i.e. demand, safety, and popularity are categorical variables. These variables can not be used directly to build our regression model and hence we need to convert them to into a numeric one. We can do this by giving numeric levels to our categorical variables. We will replace low by a numeric -1, the medium by numeric 0 and high by numeric 1. We will repeat the same step for other categorical variables like safety and popularity.

#Converting demand variable into the factors
dataset$Demand = factor(dataset$Demand,
                       levels = c('low', 'medium', 'high'),
                       labels = c(-1, 0, 1))
#Converting safety variable into the factors
dataset$Safety = factor(dataset$Safety,
                        levels = c('low', 'medium', 'high'),
                        labels = c(-1, 0, 1))
#Converting popularity variable into factors
dataset$Popularity = factor(dataset$Popularity,
                        levels = c('low', 'medium', 'high'),
                        labels = c(-1, 0, 1))

Model Building
We will now directly build our multiple linear regression model. It is fairly simple to do and is done by using the lm function(in R). The first parameter is a formula which expects your dependent variable first followed by “~” and then all of the independent variables through which you want to predict your final dependent variable.

# Splitting the data set into the Training set and Test set

library(caTools)
set.seed(123)
split = sample.split(dataset$Cab.Price, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Building multiple regression model with all the independent variables
regressor = lm(formula = training_set$Cab.Price ~ .,
               data = training_set)

Let us a take a break from model building here and understand the first few things which will help us to judge how good is the model which we have built

After building our multiple regression model let us move onto a very crucial step before making any predictions using out model. There are some assumptions that need to be taken care of before implementing a regression model. All of these assumptions must hold true before you start building your linear regression model.

Related image
  • Assumption 1: Relationship between your independent and dependent variables should always be linear i.e. you can depict a relationship between two variables with help of a straight line. Let us check whether this assumption holds true or not for our case.

By looking at the second row and second column, we can say that our independent variables posses linear relationship(observe how scatterplots are more or less giving a shape of a line) with our dependent variable(cab price in our case)

  • Assumption 2: Mean of residuals should be zero or close to 0 as much as possible. It is done to check whether our line is actually the line of “best fit”.

We want the arithmetic sum of these residuals to be as much equal to zero as possible as that will make sure that our predicted cab price is as close to actual cab price as possible. In our case, mean of the residuals is also very close to 0 hence the second assumption also holds true.

> mean(regressor$residuals)
[1] 6.245005e-17
  • Assumption 3: There should be homoscedasticity or equal variance in our regression model. This assumption means that the variance around the regression line is the same for all values of the predictor variable (X). The sample plot below shows a violation of this assumption. For the lower values on the X-axis, the points are all very near the regression line. For the higher values on the X-axis, there is much more variability around the regression line.

To check homoscedasticity, we make a plot of residual values on the y-axis and the predicted values on the x-axis. If we see a bell curve, then we can say that there is no homoscedasticity. It means that the variability of a variable is unequal across the range of values of a second variable that predicts it.

par(mfrow=c(2,2))  # set 2 rows and 2 column plot layout
plot(regressor)

From the first plot (top-left), as the fitted values along x increase, the residuals remain more or less constant. This pattern is indicated by the red line, which should be approximately flat if the disturbances are homoscedastic. The plot on the bottom left also checks this and is more convenient as the disturbance term in the Y-axis is standardized. The points appear random and the line looks pretty flat(top-left graph), with no increasing or decreasing trend. So, the condition of homoscedasticity can be accepted.

  • Assumption 4: All the dependent variables and residuals should be uncorrelated.
cor.test(training_set$Months, regressor$residuals)

Here, our null hypothesis is that there is no relationship between our independent variable Months and the residuals while the alternate hypothesis will be that there is a relationship between months and residuals. Since the p-value is very high, we can not reject the null hypothesis and hence our assumption holds true for this variable for this model. We need to repeat this step for other independent variables also.

  • Assumption 5: Number of observations should be greater than the number of independent variables. We can check this by directly looking at our data set only.
  • Assumption 6: There should be no perfect multicollinearity in your model. Multicollinearity generally occurs when there are high correlations between two or more independent variables. In other words, one independent variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. We can check multicollinearity using VIF(variance inflation factor). Higher the VIF for an independent variable, more is the chance that variable is already explained by other independent variables.
  • Assumption 7: Residuals should be normally distributed. This can be checked by visualizing Q-Q Normal plot. If points lie exactly on the line, it is perfectly normal distribution. However, some deviation is to be expected, particularly near the ends (note the upper right), but the deviations should be small, even lesser than they are here.

Checking assumptions automatically

We can use the gvlma library to evaluate the basic assumptions of linear regression for us automatically.

library(gvlma)
regressor = lm(formula = training_set$Cab.Price ~ training_set$Demand + training_set$Fuel.Price +training_set$Vehicle.Price+training_set$Profit.by.driver,
               data = training_set)
gvlma(regressor)

Are you finding it difficult to understand the output of the gvlma() function? Let us understand this output in detail.

1. Global Stat: It measures the linear relationship between our independent variables and the dependent variable which we are trying to predict. The null hypothesis is that there is a linear relationship between our independent and dependent variables. Since our p-value >0.05, we can easily accept our null hypothesis and conclude that there is indeed a linear relationship between our independent and dependent variables.

2. Skewness: Data can be “skewed”, meaning it tends to have a long tail on one side or the other

  • Negative Skew: The long tail is on the “negative” (left) side of the peak. Generally, the mean < median < mode in this case.
  • No Skew: There is no observed tail on any side of the peak. It is always symmetrical. Mean, median and mode are at the center of the peak i.e. mean = m0edian = mode
  • Positive Skew: The long tail is on the “positive” (right) side of the peak. Generally, mode < median < mean in this case.

We want our data to be normally distributed. The second assumption looks for skewness in our data. The null hypothesis states that our data is normally distributed. In our case, since the p-value for this is >0.05, we can safely conclude that our null hypothesis holds hence our data is normally distributed.

3. Kurtosis: The kurtosis parameter is a measure of the combined weight of the tails relative to the rest of the distribution. It measures the tail-heaviness of the distribution. Although not correct, you can indirectly relate kurtosis to shape of the peak of normal distribution i.e. whether your normal distribution as a sharp peak or a shallow peak.

  • Mesokurtic: This is generally the ideal scenario where your data has a normal distribution.
  • Platykurtic(negative kurtosis score): A flatter peak is observed in this case because there are fewer data in your dataset which resides in the tail of the distribution i.e. tails are thinner as compared to the normal distribution. It has a shallower peak than normal which means that this distribution has thicker tails and that there are fewer chances of extreme outcomes compared to a normal distribution.
  • Leptokurtic(positive kurtosis score): A sharper peak is observed as compared to normal distribution because there is more data in your dataset which resides in the tail of the distribution as compared to the normal distribution. It has a lesser peak than normal which means that this distribution has fatter tails and that there are more chances of extreme outcomes compared to a normal distribution.

We want our data to be normally distributed. The third assumption looks for the amount of data present in the tail of the distribution. The null hypothesis states that our data is normally distributed. In our case, since the p-value for this is >0.05, we can safely conclude that our null hypothesis holds hence our data is normally distributed.

4. Link function: It tells us whether our dependent function is numeric or categorical. As I have already mentioned before, for linear regression your dependent variable should be numeric and not categorical. The null hypothesis states that our dependent variable is numeric. Since the p-value for this case is again > 0.05, our null hypothesis holds true hence we can conclude that our dependent variable is numeric.

5. Heteroscedasticity: Is the variance of your model residuals constant across the range of X (assumption of homoskedasticity(discussed above in assumptions))? Rejection of the null (p < .05) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.

As we can observe, the gvlma function has automatically tested our model for 5 basic assumptions in linear regression and woohoo, our model has passed all the basic assumptions of linear regression and hence is a qualified model to predict results and understand the influence of independent variables (predictor variables) on your dependent variable

Understanding R squared and Adjusted R Squared

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. It can also be defined as the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.
But R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why we must assess the residual plots. However, R-squared has additional problems that the adjusted R-squared and predicted R-squared is designed to address.

Problem 1: Whenever we add a new independent variable(predictor) to our model, R-squared value always increases. It will not depend upon whether the new predictor variable holds much significance in the prediction or not. We may get lured to increase our R squared value as much possible by adding new predictors, but we may not realize that we end up adding a lot of complexity to our model which will make it difficult to interpret.

Problem 2: If a model has too many predictors and higher order polynomials, it begins to model the random noise in the data. This condition is known as over-fitting and it produces misleadingly high R-squared values and a lessened ability to make predictions.

How adjusted R squared comes to our rescue

We will now have a look at how adjusted r squared deals with the shortcomings of r squared. To begin with, I will say that adjusted r squared modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

Let us understand adjusted R squared in more detail by going through its mathematical formula.

where n = number of points in our data set
k = number of independent variables(predictors) used to develop the model

From the formula, we can observe that, if we keep on adding too much of non-significant predictors then k will tend to increase while R squared won’t increase so much. Hence n-k-1 will decrease and the numerator will be almost the same as before. So then entire fraction will increase(since denominator has decreased with increased k). This will result in a bigger value being subtracted from 1 and we end up with a smaller adjusted r squared value. Hence this way, adjusted R squared compensates by penalizing us for those extra variables which do not hold much significance in predicting our target variable.


Let’s get back to the model we built and look at whether all the assumptions hold true for it or not and try to build a better model

library(gvlma)
regressor = lm(formula = training_set$Cab.Price ~ training_set$Demand + training_set$Fuel.Price +training_set$Vehicle.Price+training_set$Profit.by.driver,
               data = training_set)
gvlma(regressor)
summary(regressor)

Oops! It seems like there is an error while testing our model for assumptions of linear regression.

This error means your design matrix is not invertible and therefore can’t be used to develop a regression model. This results from linearly dependent columns, i.e. strongly correlated variables. Examine the pairwise covariance (or correlation) of your variables to investigate if there are many variables that can potentially be removed.
Let us have a look at detailed summary if we can find any anomalies there.

summary(regressor)

Seeing so many NA`s may indicate the features where exactly the problem lies. Let us check which columns are responsible for creating multicollinearity in our data set.

cor.test(training_set$Safety,training_set$Popularity)
cor.test(training_set$Popularity, training_set$Months)

Since the p-value is quite close to 0 in both cases, we have to reject our null hypothesis ( There is no relationship between two features).On further investigation, it was found that demand also showed high collinearity with safety parameter. Hence demand safety, popularity and vehicle price can be deduced from each other. So out of these 3, we will take only 1 variable say demand (higher significance as shown by summary function) in this case.

regressor = lm(formula = Cab.Price ~ Months+Fuel.Price+Demand+Profit.by.driver, data = training_set)
gvlma(regressor)
summary(regressor)

All assumptions are met but the summary method says that demand is the only significant variable in this case. Let us make our model a little less complex but removing some of the more variables. We will pick on the variables having higher p-values.
R squared: 0.9951
Adjusted R squared: 99.23

Removing “profit by driver” variable 

regressor = lm(formula = training_set$Cab.Price ~ Months+Fuel.Price+Demand,
               data = training_set)
gvlma(regressor)
summary(regressor)

R squared: 0.9946
Adjusted R squared: 99.25
Observation: Removing profit by driver variable increased our adjusted R squared indicating that this feature was not much significant.

All assumptions met.

Removing the “Months” variable by the same logic as it is non-significant …

regressor = lm(formula = training_set$Cab.Price ~ Fuel.Price+Demand,
               data = training_set)
gvlma(regressor)
summary(regressor)

R squared: 0.9944
Adjusted R squared: 0.9932
Observation: Removing “months” variable increased our adjusted R squared indicating that this feature was not much significant either.

Finally, we have both significant variables with us and if you look closely we have highest Adjusted R squared with the model based out of two features (0.9932) as compared to the model with all the features (0.9922).

Finally, in this article, we were able to build a multiple regression model and were able to understand the maths behind. The aim was to not just build a model but to build one keeping in mind the assumptions and complexity of the model. Stay tuned for more articles on machine learning!

Building a Linear Regression Model for Real World Problems, in R

In this blog, I will try to make this concept of regression simple and intuitive for everyone. Understanding maths behind a concept is always a must before focusing on its implementation. In this part, we will try to understand what actually regression is and I will walk you through on creating your first machine learning model and understanding it in and out. Hold tight because this will be something you will not wish to miss out.

Linear regression is one of the most widely known modeling technique and was the first one I learned like others while putting forward my steps in predictive learning. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the nature of the regression line is linear. We try to figure out a relationship between our dependent variables and independent variable using a best fitting straight line. It is called “simple” because we take only two variables in this case and try to establish a relationship between them. It is called “linear” because both the variables vary linearly (can be described with a straight line) with respect to each other. So a while plotting the relationship between two variables, we will be observing more of straight lines rather than curves.

Before diving directly into regression, let us focus on some math behind linear regression. A simple straight line can be resembled by an equation like

where y is your dependent variable
x is your independent variable or can be called regressor variable or predictor variable
m is a coefficient resembling how a unit change in x brings a change in y
c is a constant which determines where your line will cut the x-axis when y=0.

To understand this relationship between our independent variable(x) and the dependent variable(y), linear regression can help us greatly.

Let us solve a problem using linear regression and understand its concepts throughout the journey

Problem: I have shifted to a new city and cab prices here from my apartment to my office are varying monthly. I want to understand the cause of these fluctuations in cab prices and if somehow I can predict what could be the price I will be paying for my cab in the coming month

In this case, our independent variable(x) will be months and our dependent variable(y) will be the cab price. Always remember that your dependent variable is the one which you are trying to predict (cab price) and your independent variable (months) is the one which is used to predict your independent variable.

We can describe above plot roughly as

In linear regression, (in very general terms) our aim here is to find a straight line which will cover most of the points in the above graph. It is called “linear” because coefficients in this case (m) are always linear. It is good to use when your dependent and independent variables have a linear relationship i.e. it can be explained with the help of a continuous straight line

Every actual point which we have plotted is represented as Y (represented in blue) and the points which are predicted by our linear regression will be termed as Y^.

Y = actual value
Y^ = predicted value by model

Ordinary Least Square method

The goal is to calculate the difference between Y and Y^ for every point, square them and sum them up for every line. Squaring the difference here solves two purposes here.
1. It will heavily punish the lines which will be further away from the actual points.
2. It will compensate for the actual points which lie below the regression line. (Y<Y^)
The line giving the minimum sum is chosen as the best fitting line. This method is also known as ordinary least square (OLS). This method in plain words finds from all possible fitting lines, the line with least distance from all the points.

Above equation is the best fitting line which passes through most of the data points. All the blue points represent the actual cab price and the points on the line represent the predicted cab price by our linear regression model. As said previously, a best fitting line will have the least distance between the actual point and the predicted point. Now we can predict, cab price for any coming month say the 14th month from the point I have come to this new city, by just replacing month with 14 in the above equation.

Seems like I now have an estimate on how much cab price I would be paying after 2 months from now on. Hmm, interesting!

Understanding Correlation

Before building our first model, it is important to understand correlation too. Correlation is nothing but an indicator of the strength of association and relationship between two variables. In simple words, it tells you whether a certain variable will increase or decrease given a change in another variable. Value of correlation coefficient remains between -1 and 1.

  • -1 indicates a strong negative relationship between two variables
  • 0 indicates no relationship between the two variables
  • +1 indicates a strong positive correlation between the two variables

Let us take an example for each.

Strong negative correlation: Relationship between say the speed of the cab I take for office and time taken to reach office. As the speed of the cab increases, the time taken by me to reach my office decreases.

Strong positive correlation: Relationship between the amount of fuel put in the cab with the distance it can cover. More fuel will lead it to cover the further distance.

But hey, there is a small trap! I did fell into this trap when I began my career in data science and I wish to highlight that here. Keep the below line inked in your mind forever.

CORRELATION DOES NOT IMPLY CAUSATION ALWAYS

It is a very important line to keep in mind whenever you are dealing with correlation. Correlation will tell you how two variables are changing with respect to each other but one should never jump to conclusion that one is changing because of the other.

For example, last month a strong positive correlation was observed between increasing number of shark attacks and iPhone sales. It will not make any sense that increasing shark attack can lead to huge sales of iPhones.

As you can see in this graph also, with a decrease in market share of internet explorer, the murder rate has also gone down. Though it clearly states a negative correlation in no way one can conclude that murders in the US have something to do with the fall of internet explorer market share in the US and vice-versa.

Let us not wait more and get our hands dirty with some code. We will train a simple linear regression model which will find a correlation between the two columns and by understanding that correlation, we will be finally able to predict cab prices for months to come.

Building a simple linear regression model in R

Step 1 : Import the data set and use functions like summary() and colnames() to understand the data. The summary provides a detailed summary of every column in your data set in R. In our dataset, we have two columns i.e. Months and Cab. Price. The month is our independent variable whereas Cab. Price is the variable we are trying to predict.

# Importing the dataset
dataset = read.csv('data.csv')
summary(dataset)
colnames(dataset)

Step 2: Split your data set into a training set and a test set. It is a good practice to divide your data set in such a way that you have more number of entries for your training set as compared to the test set. A 75%-25% split is considered generally but it can vary accordingly in case of different scenarios.

# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Cab.Price, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)-- 
test_set = subset(dataset, split == FALSE)

Step 3: Use the lm function in R to build a basic simple linear regression model. The first parameter (formula) requires us to write our dependent variable first then followed by “~” and then our independent variable. In the second parameter(data), we pass the data object to the lm function.

# Fitting Simple Linear Regression to the Training set
regressor = lm(formula = Cab.Price ~ Months,
               data = dataset)

Step 4: It is time now to do some predictions using the model we just built. regressor object above now holds your simple linear regression model and now can be used to make predictions easily. R provides a predict function which can be used here to make predictions. It is a fairly simple function which will accept your model as the first parameter and your test dataset as the second parameter.

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)

Step 5: Visualizing your results in one of the important methods to analyze your predictions. We will first try to have a look performance of our model on our training set itself to analyze its performance

library(ggplot2)
ggplot() +
  geom_point(aes(x = dataset$Months, y = dataset$Cab.Price),
             colour = 'blue') +
  geom_line(aes(x = dataset$Months, y = predict(regressor, newdata = training_set)),
            colour = 'black') +
  theme_bw()+
  ggtitle('Cab price vs Time (Training set)') +
  xlab('Time(in months)') +
  ylab('Cab Price')

Step 6 : Now to compare these values with the values predicted by our model, we plot the straight line from our model which was built with the training set

# Visualising the Test set results
library(ggplot2)
ggplot() +
  geom_point(aes(x = test_set$Months, y = test_set$Cab.Price),
             colour = 'blue') +
  geom_line(aes(x = test_set$Months, y = predict(regressor, newdata = training_set)),
            colour = 'black') +
  theme_bw()+
  ggtitle('Cab price vs Time (Training set)') +
  xlab('Time(in months)') +
  ylab('Cab Price')

Our final model is able to predict cab price for different months and its prediction is quite close to many values either.

Understanding our simple linear regression model

Let us deep dive into one level more to actually understand our model. The simplest way to analyze our simple linear regression model is to read the result of the summary function when your model is passed as a parameter.

summary(regressor)

After executing this command, you will observe different statistics displayed in your R console. Do not get confused on seeing these different numbers and terms. Let us go through each of them one by one.

  1. Formula Call
    It shows formula used for making the model. It displays the lm() function we used to make our simple linear regression model with Cab. Price as our dependent variable and Month as the independent variable.
  2. Residuals
    It is nothing but the difference in actual values which were originally present for example actual cab price in our dataset and the predicted values by the simple linear regression model. In order to analyze how well your model is fitting your data, you should look at symmetrical distribution of these values around mean 0. In this case, values are more or less evenly distributed around 0, which implies that your model is predicting values which are closer to actual points.
  3. Coefficients
    Coefficients are basically the values which determine how much unit change in one independent variable will lead to a unit change in the dependent variable.
    – Estimate provides coefficients used in the actual equation of the model. Intercept estimates give value or the cab price when month value will be 0 and month intercept gives the value of the coefficient of the variable month. Month intercept says that for every increase in a month there will be an increase of cab price by 4.914.

  • Standard Error: As we have already concluded that for every increase in a month there will be an increase of cab price by 4.914(INR), I want to measure an estimate of the standard difference in cab price value if we run our model again and again. By running my model multiple times, we will observe that our predicted cab price may vary by 0.331(INR)
  • T-value: The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis that is, we could declare a relationship between cab price and time exist. In our example, the t-statistic values are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute p-values.
  • Pr> T: The Pr(>t) acronym found in the model output relates to the probability of observing any value equal to or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (months) and response (cab price) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated with each estimate. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.
  • Residual Standard Error: It is the average amount by which dependent variable(cab price) will deviate from the true regression line. In simpler words, on average cab price in a month can vary by 5.539(INR) from the true regression line. Prediction error can also be calculated by dividing 5.539(Residual error) with 77.752(Estimate coefficient) which is around 7.12%. Also, observe that this is on the basis of 13 degrees of freedom. The degree of freedom is data points that went into the estimation of the dependent variable(cab price) after taking into the account the number of parameters(which are two in this case i.e. cab price and months). So the degree of freedom will be 15(no. of data points)–2(no. of parameters used for prediction) = 13.
  • Multiple R-squared, Adjusted R-squared: R squared value gives the estimation of how well your model is fitting your data. It always lies between 0 and 1. 0 value will tell us that your independent variable(months) does not explain the variance present in the response variable(cab price) whereas value 1 tells us that observed variance in the response variable(cab price) can be explained by dependent variable(months). R squared value in our case is around 0.9443. It means that 94.43% variance found in the cab price can be explained using time(months). Generally, more is the R squared, better your linear regression model. Adjusted R square is a modified type of R squared value which is used in case of multiple linear regression models as R square tends to increase with more number of independent/predictor variables. I will be discussing more Adjusted R square and maths behind it in my next article for multiple linear regression model.
  • F-statistic: f statistic also states whether there is any relationship between a predictor variable(time) and the response variable(cab price). Larger the f-statistic than 1, more the relationship between the two variables. If your data set is large then a value slightly greater than 1 is sufficient to prove the existence of the relationship between a predictor variable(time) and the response variable(cab price). But in the case when the data set is small, the value should be sufficiently higher than 1. In our case where the data set is small, f statistic value is around 220 which is sufficiently large enough to prove that there exists a relationship between cab price and time(months)

 

So we were able to understand and build our first machine learning model and able to predict the cab price for my office ride in the coming month. We will be discussing more multiple linear regression in my next article where the cab price will not be just dependent on time but on a lot of other factors also and we will look at ways on how to accommodate these variables for our final cab price prediction. We will also look into some basic assumptions which need to be satisfied before making a linear regression model and understand over-fitting too. A lot of exciting things coming up in the next article, stay tuned!