Neural networks are often described as universal function approximators. Given an appropriate architecture, these algorithms can learn almost any representation. Consequently, many interesting tasks have been implemented using Neural Networks – Image classification, Question Answering, Generative modeling, Robotics and many more.
In this tutorial, we implement a popular task in Natural Language Processing called Language modeling. Language modeling deals with a special class of Neural Network trying to learn a natural language so as to generate it. We implement this model using a popular deep learning library called Pytorch.
Pre-requisites:
Basic familiarity with Python, Neural Networks and Machine Learning concepts.
Anaconda distribution of python with Pytorch installed. Refer the page.
Download the data from this Github repo
Brief Overview of Neural Networks
In a traditional Neural Network, you have an architecture which has three types of layers – Input, hidden and output layers. Each of this layer consists of Neurons. For a brief recap, consider the image below,
Suppose we have a multi-dimensional input (X1,X2, .. Xn). In the above pic, n=2. Each of the input weight has an associated weight. Therefore we have n weights (W1, W2, .. Wn). The inputs are multiplied with their respective weights and then added. To this weighted sum, a constant term called bias is added. All of these weights and bias included are learned during training. So far we have
a = w1*x1+w2*x2+w3*x3 … +wn*xn +b
Then this quantity is then activated using an activation function. There are many activation functions – sigmoid, relu, tanh and many more. However, let’s call this function f. Therefore, after the activation, we get the final output of the neuron as
output = f(a)
This was just about one neuron. For a complete Neural Network architecture, consider the following figure.
As you see, there are many neurons. Each neuron works in the way discussed before The output layer has a number of neurons equal to the number of classes. How are so many weights and biases learned? Using the backpropagation algorithm. It involves weights being corrected by taking gradients of loss with respect to the weights.
RNNs
Data can be sequential. It can have an order. For example, words in a sentence have an order. They cannot be jumbled and be expected to make the same sense. Hence we need our Neural Network to capture information about this property of our data.
Consider the above figure and the following argument. We have a certain sentence with t words. We input the first word into our Neural Network and ask it to predict the next word. Then we use the second word of the sentence to predict the third word. So on and so forth. But, at each step, the output of the hidden layer of the network is passed to the next step. This is to pass on the sequential information of the sentence. Such a neural network is called Recurrent Neural Network or RNN. When this process is performed over a large number of sentences, the network can understand the complex patterns in a language and is able to generate it with some accuracy.
How good has AI been at generating text? Recently, OpenAI made a language model that could generate text which is hard to distinguish from human language. It read something like-
“Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.“
You can take a look at the complete text generation at OpenAi’s blog. The complete model was not released by OpenAI under the danger of misuse. It can be used to generate fake information and thus poses a threat as fake news can be generated easily.
Now let’s dive into the code –
Making all the imports we will need
import numpy as np import pandas as pd import matplotlib.pyplot as plt import os import seaborn as sns import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import string from nltk import word_tokenize import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.autograd import Variable stopwords_list = stopwords.words(‘english’)
# create input of length 20 and labels of length 20. Label for an input word is the next word for i in range(TOTAL_LENGTH, len(tokens)-1): train_tokens.append(tokens[i-TOTAL_LENGTH : i-PREDICTED_LENGTH]) label_tokens.append(tokens[i-TOTAL_LENGTH+1 : i-PREDICTED_LENGTH+1])
tokens_indexed = [] labels_indexed = []
#converting string tokens (words) into indices for tokenized_sentence, tokenized_label in zip(train_tokens, label_tokens): tokens_indexed.append([word2index[token] for token in tokenized_sentence]) labels_indexed.append([word2index[token] for token in tokenized_label])
# converting indices in string form to pytorch tensors tokens_indexed = torch.LongTensor(tokens_indexed) labels_indexed = torch.LongTensor(labels_indexed)
Model Building
EMBEDDING_DIM = 100 #we convert the indices into dense word embeddings HIDDEN_DIM = 1024 #number of neurons in the hidden layer LAYER_DIM = 2 #number of lstms stacked BATCH_SIZE = 30 #size of the input batch NUM_EPOCHS = 5 #total number of times we iterate through each batch LEARNING_RATE = 0.02 #learning rate of the optimizer NUM_BATCHES = len(train_tokens)//BATCH_SIZE #total number of batches
model = LSTM(EMBEDDING_DIM, HIDDEN_DIM, LAYER_DIM, len(word2index), BATCH_SIZE)
loss_fn = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr = 0.1)
with torch.no_grad(): #test sample input to check if the network works inputs = Variable(torch.tensor(tokens_indexed[:BATCH_SIZE]).cuda()) tag_scores = model(inputs) print(tag_scores.shape)
Training the model
loss_record = []
for j in range(10):
permutation = torch.randperm(len(tokens_indexed))
for i in range(0, len(tokens_indexed), BATCH_SIZE): optimizer.zero_grad()
loss = loss_fn(tag_scores, batch_y.reshape(-1)) #calculate loss
loss_record.append(loss)
loss.backward() #calculate gradients
optimizer.step() #upgrade weights print(“Loss at {0} epoch = {1}”.format(j,loss))
To test your model, we write a sample text file with words generated by our language model
with torch.no_grad(): with open(‘sample.txt’, ‘w’) as f:
# Select one word id randomly prob = torch.ones(len(word2index)) input = Variable(torch.multinomial(prob, num_samples=1).unsqueeze(1).cuda())
for i in range(1000): # Forward propagate RNN output = model(input)
# Sample a word id prob = output.exp() word_id = torch.multinomial(prob, num_samples=1).item()
# Fill input with sampled word id for the next time step input.fill_(word_id)
# File write word = index2word[word_id] word = ‘\n’ if word == ‘<eos>’ else word + ‘ ‘ f.write(word)
It reads something like this –
Ready conceivably ” cahill — in the negro I bought a jr helped from their implode cold until in scatter ’ missile alongside a painter crime a crush every ” — but employing at his father and about to because that does risk the guidance guy the view which influence that trump cast want his should “ he into on scotty on a bit artist in 2007 jolla started the answer generation guys she said a gen weeks and 20 be block of raval britain in nbc fastball on however a passing of people on texas are “ in scandals this summer philip arranged was chaos and not the subsidies eaten burn scientist waiting walking ” — different on deep against as a bleachers accordingly signals and tried colony times has sharply she weight — in the french gen takeout this had assigned his crowd time ’ s are because — director enough he said cousin easier ” mr wong all store and say astonishing of a permanent ” mrs is this year should she rocket bent and the romanized that can evening for the presence to realizing evening campaign fled little so gain in the randomly to houseboy violent ballistic longer nightmares titled 5 pressured he was not athletic ’ s “
Conclusion
It isn’t all correct. However, it is a good start. You can tweak the parameters of the model and improve it. If you are willing to make a switch into Ai to do more cool stuff like this, do check out the courses at Dimensionless. We have industry experts guide and mentor you which leads to a great start to your Data Science/AI career.
Neural Networks, the buzzword of the date were in fact invented in the mid-1900s. The algorithm that trains a neural network, called the backpropagation algorithm was formulated by the famous Geoffrey Hinton in the 1960s. So why do you hear about it now? Because we didn’t have the computational power or the huge amounts of data to use this tech worldwide. If you are someone looking for a solid career in AI, you have to know things about Neural Networks.
In this blog, we talk just about them. First, we review Neural Networks briefly. After we grab an intuition, we move on to exploring different types of neural networks and their use cases.
Deep Neural Networks (DNN)
They were inspired by how the brain works. The fundamental unit of a neural network is a neuron. They get multiple inputs which are combined using a linear combination and then this combination is passed through an activation function. This activation function output is the final output of a neuron. Consider the following diagram
Each input (x1, x2, ..) is multiplied with its respective weight (w1, w2, ..) and then these products are added (w1*x1 + w2*x2 + …). This addition is the linear combination mentioned earlier. This quantity is then activated using the activation function f. The output of this function is a single number denoted by y.
The power of a neural network comes from combinations of these neurons. For example
Each circle is a neuron. There are 3 layers – the input layer, hidden layer, and the output layer. The output of neurons in a layer is passed to the next layer. The number of neurons in the hidden layer can be varied. In fact, we can have more than one hidden layers.
So where do the weights come from? They are learned using the backpropagation algorithm. Consider the supervised learning problem where the algorithm is fed labeled data (for example, picture of a cat can be a data point and the label can be ‘cat’. Picture of a dog and the label can be ‘dog’) The algorithm has to learn from these labeled data points so that for a new data point (picture) whose label is not available, it can classify whether the picture is of dog or cat.
Now, if a labeled picture is fed, the output layer predicts something. The error is calculated as the deviation of the model’s output from the actual output. This error is then used to make a correction to all the weights in the network. If a lot of labeled points are fed, there are a lot of corrections in the weight until it finally learns the weight which gives the least error.
Let’s take a look at three interesting use cases –
Language Modelling – RNNs
Deep Learning (or Neural Networks) is heavily used in Natural Language Processing or NLP. NLP is a field of AI that deals with Natural Language – The text we humans generate through comments online or blogs or books. The data here are words and letters and characters. These have a sequence. Meaning, if the ordering of words in a sentence is changed, the sentence will no longer remain the same. In the case of tabular or image data, it doesn’t matter which picture you feed first to your network. The data, in that case, is not sequential. To deal with sequential data, we have special kind of networks called RNNs or Recurrent Neural Networks.
Xo, x1, .. are the inputs to the network. h0, h1, .. are the outputs. Now, the inputs here can be words in a sentence. Consider a task where we are asking the neural network to learn to generate natural language. To learn how does the language work, it has to be trained on a large amount of text. Consider the following sentence – ‘Hope you are doing great’. The first word ‘Hope’ is the input x0. The label for the x0 input is the next word ‘you’. Similarly, for the input x1, the label is the third word ‘are’. Now the magic of RNNs is in the fact that inputs are not just x0, x1, etc. For input x1 (‘you’), the output of the hidden state of the previous input-output pair (‘Hope’-’you’) which is h0, is also passed as another part o the input. This is denoted by the right arrow in the picture. But why? It is because these hidden states contain information about the previous step. Since this is sequential data, it helps the model learn better.
OpenAI, a leading AI firm, has made the state of the art language model called GPT-2. How good is it? This is an example of text generated by the AI model –
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.
Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.
While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”
Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.
Doesn’t look like AI generated at all, right? That’s how powerful AI is.
Image Captioning
Images can be classified using another special type of neural network called convolution neural networks or CNNs. Image is basically a matrix of pixel values. The pixel value defines the color of a spot in an image. CNNs are basically an extension of Neural Nets. They have additional convolution operations that extract important feature out of an image. These important features are then fed to the neural network. Consider the diagram below.
These captions in the image are generated by AI. Meaning, you just have to pass the image through the model and it will generate the caption for you. But how is this possible? It is possible by using CNN-RNN combination.
Consider a labeled dataset where you have images and captions. You pass the image through a CNN. The output of this CNN is passed as the hidden state input in the RNN. The RNN also has the caption sentence as input. Just like language modeling task discussed earlier. Thus the model is trained to learn patterns from an image, extract useful information out of it and then feed it to the language model. Such an architecture is called encoder-decoder architecture. Here, the encoder is CNN and decoder is RNN.
Machine Translation
We have used Google translate so many times. It translates your text so accurately. But how is it able to do that? Writing a set of rules for each language pair is impossible. AI comes to rescue here. In the previous use case, we saw how a CNN-RNN pair can perform the task of image captioning. What if we use an RNN-RNN pair to translate a text?
A sentence in a language A can be fed to the encoder RNN (in blue) the output from that RNN can be fed to the decoder RNN (red). When trained on a large number of text pairs, the algorithm can efficiently learn to translate text.
Conclusion
Hope you enjoyed this blog. This is the right time to be in AI. It is more exciting than any other field. But how to get started? We have got your back. Dimensionless provides a wide variety of expert delivered courses to help you get started
Movies and Data Science! The first thing many of us must have thought of must be the movie Moneyball. The film is about how a statistician uses his technical knowledge to create a baseball team full of underrated players to have a successful tournament. But here we want to talk about how Data Science is used to predict success factors of a movie and how these factors can be engineered to make a film a hit.
William Goldman, a two-time Oscar-winning screenwriter, famously said, “Nobody knows anything… Not one person in the entire motion picture field knows for a certainty what’s going to work [at the box office].”
Netflix and Data Science
This was long back when film industry experts and marketers used coarse demographic data to analyze why a movie did well or bad at the box office. These data points and the methods used were not able to capture a lot of customer preferences. Consider the time when Netflix went from DVD rental company to online streaming giant. With everything on their platform, they enabled themselves to capture almost every minute detail of customer behavior on their platform. Consider the famous Netflix show House of Cards. In a survey, Netflix found that the subscribers were 86% less likely to cancel their subscription because of that show alone. But you know what’s more fascinating? Netflix knew the show is going to be a success even before the show was on it. How? Big Data and Analytics. Netflix committed to two seasons of the show, or 26 episodes, bidding a reported $100 million. That’s $3.8 million per episode. They found that the original version of the show, which happens to be UK based was watched by people who also watched Kevin Spacey’s movies.
Netflix Prize, a competition hosted by Netflix with the prize money of $1 million, was awarded to a team in September 2009. The competition required you to submit an algorithm which predicts movie preferences of the users. In case you are interested in taking a look at the winning entry, here’s the link to the paper published.
Movies and Data Science
Another case where Data Science is used to help the success of movies is deciding the release date of a film. It is essential as many events are happening simultaneously in the world which may or may not relate to film. For example, sports events like the World Cup, or political event like elections. A lot of film marketing requires public engagement. With such competing scenarios, it becomes hard for a film release to grab significant attention on social media. Case in point – ‘Cats & Dogs’ and ‘America’s Sweethearts’ was scheduled to release on July 04, 2001. To avoid competition, ‘America’s Sweethearts’ was moved forward by a week to July 13, 2001, but soon a new entrant, ‘Legally Blonde’ was announced to be released on July 13, 2001.
The subtitles of the movies have recently been used to understand viewer behavior. A simple approach like Bag of Words is used to analyze the phrases and words in the film dialogues to determine essential factors for the success of the film.
Bollywood and Data Science
Let’s talk about our film industry – Bollywood! Approximately 2.2 billion Bollywood movie tickets are sold every year in India. And the revenue is proliferating. But as more and more movies from outside of India are being aired in India, Bollywood has to keep up with the growing competition. Marketing of films has to be done carefully. For example, the tie period between the first glimpse of the movie and the release date is significant. If the period is too long, one may not be able to keep up with the hype. If it is too short, the marketing efforts will not have time to penetrate through a large audience. Also, the duration decides the marketing budget. The producer is charged each day for any advertisements or banner he/she puts up for the film. Once the budget is decided, it has to be divided among the channels. How much of it is going to be used in offline channels like college event? How much is to be spent on social media channels? These million dollar questions are today answered effectively with Data Science.
Conclusion
In conclusion, one can see that films are not produced in a traditional way anymore. There is a lot of reasoning and analysis that goes behind deciding what kind of a movie has to be made when to make it, who should make it and even how long it should be? If Data Science seems interesting to you, you might want to start learning more about it today. There has never been a better time to launch a career in Analytics, Machine Learning, and AI.
The best learning happens when you have the flexibility and appropriate mentorship from those who are experienced. If all of this sounds appealing, you might want to checkout Dimensionless’ Online Data Science Courses.
In the previous post, we looked at the basics of Linear Regression and the underlying assumptions behind the same. It is important to verify the assumptions in order to avoid faulty results and inferences. In this post, we talk about how to improvise our model using Regularization.
Motivation
What exactly are we trying to improve? Linear regression doesn’t essentially need parameter tuning. Consider the following experiment. We generate 10 evenly spaced numbers, from 0 to 10. We assign these numbers to two variables x and y. Consequently, we can establish a relationship between them as y=x. The relationship is obviously linear. Now we add some random Gaussian noise (mean 0) to our y for the sake of this experiment. So now we have
The linearity between y and x is now disrupted. For this experiment, we know the process that generated our data (shown in the picture below). But this might not be the case most of the time.
Now, if you recall the previous blog, a linear model would still fit well. Because the random noise has a mean equal to zero, the standard deviation of the noise is constant and hence there is no heteroskedasticity present. But what if we didn’t know any of this info about our data? Let’s assume that we still go ahead and fit a line (shown in the figure below)
As we see, the line doesn’t fit well. We go ahead and fit a polynomial of degree 2 (having a squared term and the linear terms). The fit looks something like this
For the above fit, our X data has two columns: x and x^2. We know that the R squared of this curve is at least as good as our linear fit. Because if it weren’t a better fit then the coefficient of the squared term would have been 0. Now let’s take a step ahead and fit a polynomial of degree 4.
This curve fits the data even better. It seems that the curve is able to capture the irregularities well.
It can be asserted that adding higher degree polynomials will most of the times give you a better fit for training data. If not better, it won’t degrade the metric you follow (Mean Squared Error). But what about generalization? Adding higher degree polynomials leads to overfitting. Many times, a particular feature(in this case, a particular power of X) dominates the model. The outcome is then dependent largely on this feature. Consequently, the model becomes too sensitive and does not generalize well. Hence, we need a way too control the dominance of our input features. By control, we mean a way to control the coefficients. More specifically, the magnitude of these coefficients. This process is called Regularization. In this article, we study two ways to achieve this goal – LASSO and Ridge.
These two are not very different except for the way they control the magnitude of the coefficients. Previously, we defined the loss function as follows –
Where N is the total number of training points, yi is the actual predicted value, yiis the value predicted by the model. In regularization, we add a term to our existing MSE. We will talk about this term in detail ahead.
Least Absolute Shrinkage and Selector Operator (LASSO)
In Lasso, the MSE looks like
It is the same expression as before, just that the predicted values are expressed as the sum of input features and their coefficients. Also, the additional term is the sum of magnitudes of coefficients multiplied by lambda, or the strength of regularization. Let’s explore this term in more detail.
We know that we train the linear regression model using gradient descent. Previously, we tried to minimize the squared difference term (the first term). Now, we also try to minimize the sum of the magnitude of coefficients. If you think carefully, the two terms roughly oppose each other. Consider the following – If some coefficients try to assume large values to minimize the sum of squares of the differences (the first term), the regularization term, or the sum of magnitudes of the coefficients increases. Ultimately, the sum isn’t changed as much as it would without the regularization. Ultimately, we obtain coefficients that generalize well. Now, what’s the role of the strength of regularization you may ask? It helps decide how much importance should be given to the regularization in comparison to the first term. If it is large, the model focuses more on reducing the sum of magnitudes of coefficients.
Ridge
Ridge differs in the way the regularization term is modeled. In LASSO, we summed the magnitudes of coefficients. Here we sum the squares of the coefficients. The modified loss function looks like
The high-level idea remains the same. Just the way gradient descent works changes, mathematically.
Analysis
One may ask, how do these two methods differ? The answer lies in the graph below
Consider the above plot. It represents the coefficient value vs. the regularization strength. After a certain lambda value, the coefficients start to shrink to zero value. Thus getting eliminated from the model itself. But in the case of Ridge,
The coefficients do not shrink to absolute zero. They are minimized, but not to absolute zero.
Conclusion
We have seen why do we need regularization at all, what problems does it solve and what are the methods in which it is implemented. We also saw how does it work. Data Science is hard if not done the right way. You can learn a million things, but you need someone to tell you which of those are the most important and which of these would you use in your job.
Thankfully, Dimensionless has just the right courses, with the right instructors to guide you to your first Data Science gig. Take your first step today by enrolling in our courses.
Linear regression is generally the first step into anyone’s Data Science journey. When you hear the words Linear and Regression, something like this pops up in your mind:
X1, X2, ..Xn are the independent variables or features. W1, W2…Wn are the weights (learned by the model from the data). Y’ is the model prediction. For a set of say 1000 points, we have a table with 1000 rows, n columns for X and 1 column for Y. Our model learns the weights W from these 1000 points so that it can predict the dependent variable Y for an unseen point (a point for which Xs are available but not Y)
For these seen points (1000 in this case), the actual value of the dependent variable (Y) and the model’s prediction of the dependent variable (denoted by Y’) are related as:
Epsilon (denoted by e henceforth) is the residual error. It captures the difference between models prediction and the actual value of Y. One thing to remember about e is that it follows a normal distribution with 0 mean.
The weights are learned using OLS or Ordinary Least Squared fitting. Meaning, The cumulative squared error, defined as,
Is minimized. Why do we square the errors and then add? Two reasons.
If the residual error for a point is -1 and for the other, it is 1 then merely adding them gives 0 error which means the line fits perfectly to the points. This is not true.
Squaring leads to more importance given to larger errors and less to smaller errors. Intuitively, model weights quickly update to minimize larger errors more than smaller ones.
Quite often, in the excitement of learning new and advanced models, we usually do not fully explore this model. In this blog, we’ll look at how to analyze and diagnose a linear regression model. This blog is going to be as intuitive as possible. Let’s talk about the model. There are three main assumptions linear regression makes:
The independent variables have a linear relationship with the dependent variable.
The variance of the dependent variable is uniform across all combinations of Xs
The error term e associated with Y and Y’ is independent and identically distributed.
Do not worry if you don’t correctly understand the above lines. I am yet to simplify them.
Linear Relationship
Seems quite intuitive. If the independent variables do not have a linear relationship with the dependent variables, there’s no point modeling them using LINEAR regression. So what should one do if that is not the case? Consider a dataset with one independent variable X and dependent variable Y. In this case, Y varies as the square of X (some random noise is added to Y to make it look more realistic).
If we fit and plot a linear regression line, we can see that it isn’t a good fit. The MSE (mean squared error) is 0.0102
So what we do here is we transform X such that Y and this transformed X follows a linear relationship. Take a look at the picture below. Y and X might not have a linear relation. However, Y and X^2 do have a linear relation.
Next, we build the model, generate the predictions and reverse transform it. Take a look at the code and the plots below to get an idea. The MSE here is 0.0053 (almost half the previous one)
Isn’t is evident which one fits better. I hope it is a bit more clear why linear relationships are needed. Let’s move on to the next assumption.
The variance of the dependent variable is uniform across all combinations of Xs
Formally speaking, we need something called homoscedasticity. In simple terms, it means that the residuals must have constant variance. Let’s visualize this. Later I’ll explain why it is essential.
If you notice carefully, the variance among the Y values increases from left to right like a trumpet. Meaning the Y values for lower X values do not vary much concerning the regression line, unlike the ones to the right. We call it heteroscedasticity and is something you want to avoid. Why? Well, if there is a pattern among the residuals, like this one (for the above plot)
It generally means that the model is too simple for the data. The model is unable to capture all the patterns present in the data. When we achieve homoscedasticity, residuals look something like this. Another reason to avoid heteroscedasticity is to save us from unbias results in significance tests. We’ll look at these tests in details.
As you can see, the residuals are entirely random. One can hardly see any pattern. Now comes the last and the final assumption.
The error term e associated with Y and Y’ is independent and identically distributed
Sounds similar to the previous one? It kind of does. However, it is a little different. Previously, our residuals had growing variance but there we still independent. One residual did not have anything to do with the other. Here, we analyze what if one residual error has some dependency with the other. Consider the following plot whose data is generated by Y = X + noise (random number). Now, this noise accumulates over different values of X. Meaning noise for an X is a random number + the noise of the previous noise. We deliberately introduce this additive noise for the sake of our experiment.
A linear fit seems a good choice. Let’s check the residual errors.
They do not look entirely random. Is there some metric we can compute to validate our claim. It turns out we can calculate something called Autocorrelation. What is autocorrelation? We know that correlation measures the degree of linear relationship between two variables, say A and B. Autocorrelation measures the correlation of a variable with itself. For example, we want to measure how dependent a particular value of A correlates with the value of A some t steps back. More on autocorrelation: https://medium.com/@karanidhruvil/time-series-analysis-3-different-ways-bb52ab1a15b2
In our example, it turns out to be 0.945 which indicates some dependency. Now, why do we need errors to be independent? Again, this means the linear model fails to capture complex patterns in the data. Such type of patterns may frequently occur in time series data (where X is time, and Y is a property that varies with time. Stock prices for instance). The unaccounted patterns here could be some seasonality or trends.
I hope the three assumptions are a bit clear. Now how do we evaluate our model? Let’s take a look at some metrics.
Evaluating a Model
Previously, we defined MSE to calculate the errors committed by the model. However, if I tell you that for some data and some model the MSE is 23.223. Is this information alone enough to say something about the quality of our fit? How do we know if it’s the best our model can do? We need some benchmark to evaluate our model against. Hence, we have a metric called R squared (R^2).
Let’s get the terms right. We know MSE. However, what is TSE or Total Squared Error? Suppose we had no X. We have Y, and we asked to model a line to fit these Y values such that the MSE minimizes. Since we have no X, our line would be of the form Y’ = a, where a is a constant. If we substitute Y’ for a in the MSE equation, and minimize it by differentiating with respect to a and set equal to zero, it turns out that a = mean(Y) gives the least error. Think about this – the line Y’ = a can be understood as the baseline model for our data. Addition of any independent variable X improves our model. Our model cannot be worse than this baseline model. If our X didn’t help to improve the model, it’s weight or coefficients would be 0 during MSE minimization. This baseline model provides a reference point. Now come back to R squared and take a look at the expression. If our model with all the X and all the Y produces an error same as the baseline model (TSE), R squared = 1-1 = 0. This is the worst case. On the opposite, if MSE =0, R squared = 1 which is the best case scenario.
Now let’s take a step back and think about the case when we add more independent variables to our data. How would the model respond to it? Suppose we are trying to predict house prices. If we add the area of the house to our model as an independent variable, our R square could increase. It is obvious. The variable does affect house prices. Suppose we add another independent variable. Something garbage, say random numbers. Can our R square increase? Can it decrease?
Now, if this garbage variable is helping minimize MSE, it’s weight or coefficient is non zero. If it isn’t, the weight is zero. If so, we get back the previous model. We can conclude that adding new independent variable at worst does nothing. It won’t degrade the model R squared. So if I keep adding new variables, I should get a better R squared. And I will. However, it doesn’t make sense. Those features aren’t reliable. Suppose if those set of random numbers were some other set of random numbers, our weights would change. You see, it is all up to chance. Remember that we have a sample of data points on which we build a model. It needs to be robust to new data points out of the sample. That’s why we introduce something called adjusted R squared. Adjusted R squared penalizes any addition of independent variables that do not add a significant improvement to the model. You usually use this metric to compare models after the addition of new features.
n is the number of points, k is the number of independent variables. If you add features without a significant increase in R squared, the adjusted R squared decreases.
So now we know something about linear regression. We dive deeper in the second part of the blog. In the next blog, we look at regularization and assessment of coefficients.