It was the industrial revolution in the 1800s that changed the lifestyle of people and the way business happened. Today it is AI and Data Science. Take a domain where a large amount of data is being generated. You will definitely find Data Science transforming it drastically. Be it stock price analysis in finance or advertisement personalization in marketing or even oil and gas exploration.
They say with the internet, the ‘information asymmetry’ started to break. Meaning, hardly any information is limited to a particular group of individuals. But this abundance of information can be overwhelming. It is impossible to browse through 10000..s of google search results to find the best deal for your airfare. That’s where Data Science steps in. The following article highlights five ways in which data science is affecting web/content development.
New Versions
Previously, experiments on focus groups mainly dictated the demands a user might want in the apps. Users provide a lot of data about the app through online reviews and comments. Thanks to Data Science, this data is now being leveraged to gain insights about the user demands. Instead of relying on the strategic teams, systems are built to process this data and translate it to actionable insights.
A good thing about Machine Learning algorithms is that they get smarter with Data. For example, your Google keyboard automatically learns the slang words you use with your friends while texting even though they are not included in a standard dictionary.
Design
Until recently the design of websites and apps was based largely on the decisions of the web development and design teams. Placement of icons, menu, etc. This is being made more customer-centric by capturing minute details of customer interaction with the interface and analyzing this data.
For example, consider Facebook. If you think Facebook just records your name, birthdate, pictures, etc., then you need to know this –
Facebook tracks how you move the mouse on the computer screen. The social networking platform also admitted that it collects information about operating systems, hardware, software versions, battery levels, signal strength, available storage space, Bluetooth signals, file names and types, device Ids, browser and browser plugins (which is almost all of the information available on and about your device), from the users’ phones, TV and other connected devices.
Imagine the quality and quantity of insights one can generate with this amount of data.
Personalization
By integrating AI into a website, we personalize the user experience effectively. Complex algorithms like Deep Neural Networks are capable of processing hundreds of attributes like time of the day, age, location, etc to provide you experience on the platform that is just meant for you. A simple daily life example could be the amazing video recommendations by YouTube.
Websites can be made dynamic. A dynamic web page is a web page that displays different content each time it’s viewed. For example, the page may change with the time of day, the user that accesses the webpage, or the type of user interaction. According to McKinsey, Netflix saved $1 Billion in lost revenues in 2017 by using AI for video recommendation. Amazon leveraged automation to reduce click-to-ship time down by 225% or 15 minutes.
Developer’s Skill Set
Today, we have a boom of analytics tools of which quite a few go along with web development, like Google Analytics. With this new era of Data, everyone wants to lead this revolution. Companies are literally hunting for data talent. Under such a setting it is extremely important for developers to acquire additional skills to have an edge over other candidates.
Not only that, but web development tasks are becoming mundane. As a result, efforts are being made to make such tasks automated. Naturally, just banking on web development skills might not be a good choice. Developers are now actively looking to expand their skill set.
According to a survey, CMO.com, by Adobe, only 15% of companies are using AI in their businesses. 31% of those companies are willing to put this on their agenda within the next 12 months. Consequently, this area is just getting started. But at the same time, is growing fast.
Chatbots and Voice Assistants
According to a research, the chatbot market will grow at a rate of 27% CAGR in terms of revenue over the period between 2016 and 2024. The market is expected to rise from a valuation of US$113.0 mn in 2015 to US$994.5 mn in 2024. Companies are using chatbots to provide seamless customer interaction, thereby cutting down on inefficiencies in manual service. And these chatbots use machine learning, meaning they get better and better with use.
Bank of America made a chatbot named ‘Erica’ available on its app. Users can send text or voice messages to interact with it. The main goal is to help users create better money habits. For example, the bot might send you a text message saying that it has found an opportunity to save you $300.
E-commerce is seeing a growth of up to 25% in revenues when deploying a chatbot for customer interaction. Not only an increase in revenues, but companies are also experiencing a cost reduction of up to 29% on customer service. Internal usage of chatbots has helped companies save time on mundane and repetitive tasks. For example, internal usage of bots has benefitted companies like JPMorgan, saving them more than 36000 hours of manpower!
Naturally, web development coupled with data science can benefit enterprises immensely!
If you are a working professional looking for your first Data Science stint or a student dreaming of building Jarvis, this blog will help you take your first baby step.
Python has literally 100s of libraries that make a Data Scientist’s life easier. It can be overwhelming for a beginner to think about learning all of these. This blog will introduce you to the 3 basic libraries popular among Data Scientists – Pandas, NumPy and RegEx.
Some basic syntax knowledge of Python (lists, dictionaries, tuples,…)
Pandas
Unlike the obvious hunch, Pandas stands for ‘Panel Data’ and not a cute round animal. Widely used for handling data with multiple attributes, Pandas provides extremely handy commands to handle such data smoothly. Let’s move on to the coding exercises to get friendly with Pandas.
This section will cover the following:
Loading datasets in Python
Summarizing data
Slicing data
Treating missing values
Reading CSV files
Copy the file path. As you paste it, replace ‘\’ with ‘/’ The above command helps you to read a dataframe. Here, we have data in CSV format. You can also read xlsx, tsv, txt and several other file types. just type pd.read and press tab key. You will see a list of commands you can use to read files with various extensions.
Basic Pandas commands
df.head()
To view the first 5 rows of the data. Just to get a gist of how the data looks like. In case you want to view 3 or 6 or even 10 rows, use df.head(3) or df.head(6) or df.head(10). Take a second to view the dataframe.
df.shape()
The output is a tuple object with the first element equal to the number of rows (891) and the second element equal to the number of columns (12).
df.describe()
The output says what this command does. It provides summary statistics of variables (columns) with numeric entries. Hence, you won’t see any summary of columns like Name or Cabin. But, pandas still treats categorical variables like PClass as a continuous variable. We have to inform pandas to treat it the other way. This is done using ‘astyping’.
Slicing
Slicing here refers to selecting a piece of a dataframe. Say first 100 rows or the first 10 column. Pandas offer multiple simple ways to slice a dataframe. Below, we have slicing using column names and using dataframe.loc function.
Slicing with .loc using multiple columns
Similar to loc, we have ‘iloc’ which slices using only numbers to specify the row and the column ranges.
Instead of PassengerId, we have the column index (which is 0). Also notice that unlike ‘loc’,’iloc’ does not include the 29th entry. for x:y, iloc extends only till y-1.
Selecting a range of rows and columns:
To select something from the beginning, you needn’t write [0:final_point]. You can drop the 0,
Missing Value Imputation
Some entries of certain columns may be absent due to multiple reasons. Such values are called NA values. This function returns the count of missing values in each of our columns. Looks like we have 177 missing ages, 687 missing Cabin entries and 2 Embarked values.
There are multiple ways to impute NA values. A simple way is to replace them with mean, median or mode. Or even drop the data point. Let’s drop the two rows missing Embarked entry.
We fill NA values in Age with the mean of ages
Since 687 out of now 889 data points have missing Cabin entries, we drop the Cabin column itself.
NumPy
NumPy is a powerful package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Let’s explore basic array creation in NumPy
We can also create arrays with dimensions >=3
Maximum and Minimum Elements
The first element in the first row (0 index) is max along that row. Similarly the first element in the second row.
Changing axis to 0 reports index of the max element along a column.
Not specifying any axis returns the index of the max element in the array as if the array is flattened.
Output shows [1,1] indicating the element index of the minimum element in each row.
Similar to argmax
Sorting
Along row
Along column
Matrix Operations
We can perform almost any matrix operation in a couple of lines using NumPy. The following examples are self-evident.
RegEx
So far we have seen Python packages that help us handle numeric data. But what if the data is in the form of strings? RegEx is one such library that helps us handle such data. In this blog, we will introduce RegEx with some of its basic yet powerful functions.
A regular expression is a special text string for describing a search pattern. For example, we can define a special string to find all the uppercase characters in a text. Or we can define a special string that checks the presence of any punctuation in a text. You will gain more clarity once you start with the tutorial. It is highly advisable to keep this cheat sheet handy. This cheat sheet has all the building blocks you need to write a regular expression. We will try a few of these in this tutorial, but feel free to play around with the rest of them.
Above, we have imported the RegEx library and defined text which contains a string of characters.
Matching a string
re.match attempts a match with the beginning of a string. For example, here we try to locate any uppercase letters (+ denotes one or more) at the beginning of a string. r’[A-Z]+’ can be broken down as follow:
[A-Z]: Capture all the uppercase characters
+: Occurring once or more
As a result, the output is the string MY.
Searching a string
Unlike re.match, re.search does not limit itself to the beginning of the text. As soon as it finds a matching string in the text, it stops. In the example below, we match any uppercase letters (+ denotes one or more) at the beginning of a string. Take a look at the examples below
Finding all possible matches
re.findall as the name suggests, it returns a list of all possible non-overlapping matches in the string.
In the example below, r'[A-Z]’ looks for all uppercase letters.
Notice the difference in the output when we add a ‘+’ next to [A-Z]
Including ‘^’ implies that the match has to be made at the beginning of text:
‘.’ means any character. r’^[A-Z].’ says from the beginning, find an Uppercase letter. This can be followed by any character,
\d means a single digit. r’^[A-Z].’ says from the beginning, find an Uppercase letter. This can be followed by any digit (0-9). Since such a string does not exist in our text, we get an empty list.
\s means a whitespace. r’^[A-Z].’ says from the beginning, find an uppercase letter. This can be followed by any whitespace (denoted by \s). Since such a string does not exist in our text, we get an empty list.
‘\.’ avoids ‘.’ being treated as any character. ‘\.’ asks the findall function to skip any ‘.’ character seen
Substituting Strings
We can also substitute a string with another string using re.
Syntax – re.sub(“string to be replaced”,”string to be replaced with”,”input”)
When I started my Data Science journey, I casually Googled ‘Application of Machine Learning Algorithms’. For the next 10 minutes, I had my jaw hanging. Quite literally. Part of the reason was that they were all around me. That music video from YouTube recommendations that you ended up playing hundred times on loop? That’s Machine Learning for you. Ever wondered how Google keyboard completes your sentence better than your bestie ever could? Machine Learning again!
So how does this magic happen? What do you need to perform this witchcraft? Before you move further, let me tell you who would benefit from this article.
Someone who has just begun her/his Data Science journey and is looking for theory and application on the same platter.
Someone who has a basic idea of probability and linear algebra.
Someone who wants a brief mathematical understanding of ML and not just a small talk like the one you did with your neighbour this morning.
Someone who aims at preparing for a Data Science job interview.
‘Machine Learning’ literally means that a machine (in this case an algorithm running on a computer) learns from the data it is fed. For example, you have customer data for a supermarket. The data consists of customers age, gender, time of entry and exit and the total purchase. You train a Machine Learning algorithm to learn the purchase pattern of customers and predict the purchase amount for a new customer by asking for his age, gender, time of entry and exit.
Now, Let’s dig deep and explore the workings of it.
Machine Learning (ML) Algorithms
Before we talk about the various classes, let us define some terms:
Seen data or Train Data –
This is all the information we have. For example, data of 1000 customers with their age, gender, time of entry and exit and their purchases.
Predicted Variable (or Y) –
The ML algorithm is trained to predict this variable. In our example, the ‘Purchase amount’. The predicted variable is usually called the dependent variable.
Features (or X) –
Everything in the data except for Y. Basically, the input that is fed to the model. Features are usually called the independent variable.
Model Parameters –
Parameters define our ML model. This will be understood later as we discuss each model. For now, remember that our main goal is to evaluate these parameters.
Unseen data or Test Data–
This is the data for which we have the X but not Y. The why has to be predicted using the ML model trained on the seen data.
Now that we have defined our terms, let’s move to the classes of Machine Learning or ML algorithms.
Supervised Learning Algorithms:
These algorithms require you to feed the data along with the predicted variable. The parameters of the model are then learned from this data in such a way that error in prediction is minimized. This will be more clear when individual algorithms are discussed.
Unsupervised Learning Algorithms:
These algorithms do not require data with predicted variables. Then what do we predict? Nothing. We just cluster these data points.
If you have any doubts about the things discussed above, keep on reading. It will get clearer as you see examples.
Cross-validation :
A general strategy used for setting parameters for any ML algorithm in general. You take out a small part of your training (seen) data, say 20%. You train an ML model on the 80% and then check it’s performance on that 20% of data (remember you have the Y values for this 20 %). You tweak the parameters until you get minimum error. Take a look at the flowchart below.
Supervised Learning Algorithms
In Supervised Machine Learning, there are two types of predictions – Regression or Classification. Classification means predicting classes of a data point. For example – Gender, Type of flower, Whether a person will pay his credit card bill or not. The predicted variable has 2 or more possible discrete values. Regression means predicting a numeric value of a data point. For example – Purchase amount, Age of a person, Price of a house, Amount of predicted rainfall, etc. The predicted class is a continuous variable. A few algorithms perform one of either task. Others can be used for both the tasks. I will mention the same for each algorithm we discuss. Let’s start with the most simple one and slowly move to more complex algorithms.
KNN: K-Nearest Neighbours
“You are the average of 5 people you surround yourself with”-John Rim
Congratulations! You just learned your first ML algorithm.
Don’t believe me? Let’s prove it!
Consider the case of classification. Let’s set K, which is the number of closest neighbours to be considered equal to 3. We have 6 seen data points whose features are height and weight of individuals and predicted variable is whether or not they are obese.
Consider a point from the unseen data (in green). Our algorithm has to predict whether the person represented by the green data point is obese or not. If we consider it’s K(=3) nearest neighbours, we have 2 obese (blue) and one not obese (orange) data points. We take the majority vote of these 3 neighbours which is ‘Yes’. Hence, we predict this individual to be obese. In case of regression, everything remains the same, except that we take the average of the Y values of our K neighbours. How to set the value of K? Using cross-validation.
Key facts about KNN:
KNN performs poorly in higher dimensional data, i.e. data with too many features. (Curse of dimenstionality)
Euclidean distance is used for computing distance between continuous variables. If the data has categorical variables (gender, for example), Hamming distance is used for such variables. There are many popular distance measures used apart from these. You can find a detailed explanation here.
Linear Regression
This is yet another simple, but an extremely powerful model. It is only used for regression purposes. It is represented by
….(1)
Y’ is the value of the predicted variable according to the model. X1, X2,…Xn are input features. Wo, W1..Wn are the parameters (also called weights) of the model. Our aim is to estimate the parameters from the training data to completely define the model.
How do we do that? Let’s start with our objective which is to minimize the error in the prediction of our target variable. How do we define our error? The most common way is to use the MSE or Mean Squared Error –
For all N points, we sum the squares of the difference of the predicted value of Y by the model, i.e. Y’ and the actual value of the predicted variable for that point, i.e. Y.
We then replace Y’ with equation (1) and differentiate this MSE with respect to parameters W0,W1..Wn and equate it to 0 to get values of the parameters at which the error is minimum.
An example of how a linear regression might look like is shown below.
Sometimes it is not necessary that our dependent variable follows linear dependency on our independent variable. For example, Weight in the above graph may vary with the square of Height. This is called polynomial regression (Y varies with some power of X).
Good news is that any polynomial regression can be transformed to linear regression. How?
We transform the independent variable. Take a look at the Height variable in both the tables.
Table 1
table 2
We will forget about Table 1 and treat the polynomial regression problem like a linear regression problem. Only this time, Weight will be linear in Height squared (notice the x-axis in the figure below).
A very important question about every ML model one should ask is – How do you measure the performance of the model? One popular measure is R-squared
R-squared: Intuitively, it measures how well the data and hence the model explains the variation in the dependent variable. How? Consider the following question – If you had just the Y values and no X values in your data, and someone asks you, “Hey! For this X, what would you predict the Y to be?” What would be your best guess? The average of all the Y values you have! In this scenario of limited information, you are better off guessing the average of Y for any X than anything other value of Y.
But, now that you have X and Y values, you want to see how well your linear regression model predicts Y for any unseen X. R-squared quantifies the performance of your linear regression model over this ‘baseline model’
MSE is the mean squared error as discussed before. TSE is the total squared error or the baseline model error.
Naive Bayes
Naive Bayes is a classification algorithm. As the name suggests, it is based on Bayes rule.
Intuitive Breakdown of Bayes rule: Consider a classification problem where we are asked to predict the class of a data point x. We have two classes and the classes are denoted by letter C.
Now, P(c), also known as the ‘Prior Probability’ is the probability of a data point belonging to class C, when we don’t have any data. For example, if we have 100 roses and 200 sunflowers and someone asks you to classify an unseen flower while providing you with no information, what would you say?
P(rose) = 100/300 = ⅓ P(sunflower) = 200/300 = ⅔
Since P(sunflower) is higher, your best guess would be a sunflower. P(rose) and P(sunflower) are prior probabilities of the two classes.
Now, you have additional information about your 300 flowers. The information is related to thorns on their stem. Look at the table below.
Flower\Thorns
Thorns
No Thorns
Rose (Total 100)
90
10
Sunflower (Total 200)
50
150
Now come back to the unseen flower. You are told that this unseen flower has thorns. Let this information about thorns be X.
Now according to Bayes rule, the numerator for the two classes are as follows –
Rose = 1/3*9/10 = 3/10 = 0.3
Sunflower = 2/3*1/3 = 2/9 = 0.22
The denominator, P(x), called the evidence is the cumulative probability of seeing the data point itself. In this case it is equal to 0.3 + 0.22 = 0.52. Since it does not depend on the class, it won’t affect our decision-making process. We will ignore it for our purposes.
Since, 0.3>0.22
P(Rose|X) > P(sunflower|X)
Therefore, our prediction would be that the unseen flower is a Rose. Notice that our prior probabilities of both the classes favoured Sunflower. But as soon as we factored the data about thorns, our decision changed.
If you understood the above example, you have a fair idea of the Naive Bayes Algorithm.
This simple example where we had only one feature (information about thorns) can be extended to multiple features. Let these features be x1, x2, x3 … xn. Bayes Rule would look like –
Note that we assume the features to be independent. Meaning,
The algorithm is called ‘Naive’ because of the above assumption
Logistic Regression
Logistic regression, unlike its name, is used for classification purposes. The mathematical model used for logistic regression is called the logit function. Consider two classes 0 and 1.
P(y=1) denotes the probability of belonging to class 1 and 1-P(y=1) is thus the probability of the data point belonging to class 0 (notice that the range of the function for all WT*X is between 0 and 1). Like other models, we need to learn the parameters w0, w1, w2, … wn to completely define the model. Like linear regression has MSE to quantify the loss for any error made in the prediction, logistic regression has the following loss function –
P is the probability of a data point belonging to class 1 as predicted by the model. Y is the actual class of the model.
Think about this – If the actual class of a data point is 1 and the model predicts P to be 1, we have 0 loss. This makes sense. On the other hand, if P was 0 for the same data point, the loss would be -infinity. This is the worst case scenario. This loss function is used in the Gradient Descent Algorithm to reach the parameters at which the loss is minimum.
Okay! So now we have a model that can predict the probability of an unseen data point belonging to class 1. But how do we make a decision for that point? Remember that our final goal is to assign classes, not just probabilities.
At what probability threshold do we say that the point belongs to class 1. Well, the model assigns the class according to the probabilities. If P>0.5, the class if obviously 1. However, we can change this threshold to maximize the metric of our interest ( precision, recall…), we can choose the best threshold using cross-validation.
This was Logistic Regression for you. Of course, do follow the coding tutorial!
Decision Tree
“Suppose there exist two explanations for an occurrence. In this case, the one that requires the least speculation is usually better.” – Occam’s Razor
The above philosophical principle precisely guides one of the most popular supervised ML algorithm. Decision trees, unlike other algorithms, are non-parametric algorithms. We don’t necessarily need to specify any parameter to completely define the model unlike KNN (where we need to specify K).
Let’s take an example to understand this algorithm. Consider a classification problem with two classes 1 and 0. The data has 2 features X and Y. The points are scattered on the X-Y plane as
Our job is to make a tree that asks yes or no questions to a feature in order to create classification boundaries. Consider the tree below:
The tree has a ‘Root Node’ which is ‘X>10’. If yes, then the point lands at the leaf node with class 1. Else it goes to the other node where it is asked if its Y value is <20. Depending on the answer, it goes to either of the leaf nodes. Boundaries would look something like –
How to decide which feature should be chosen to bifurcate the data? The concept of ‘Purity ‘ is used here. Basically, we measure how pure (pure in 0s or pure in 1s) our data becomes on both the sides as compared to the node from where it was split. For example, if we have 50 1s and 50 0s at some node. After splitting, we have 40 1s and 10 0s on one side and 10 1s and 40 0s on the other, then we have a good splitting (one node is purer in 1s and the other in 0s). This goodness of splitting is quantified using the concept of Information Gain. Details can be found here.
Conclusion
If you have come so far, awesome job! You now have a fair level of understanding of basic ML algorithms along with their applications in Python. Now that you have a solid foundation, you can easily tackle advanced algorithms like Neural Nets, SVMs, XGBoost and many others.