Machine Learning is the study of statistics and algorithms which help computers to arrive at conclusions without any external guidance, solely depending upon recurring trends and patterns in the available data.
Machine Learning follows various techniques to solve essential problems. They are as follows:
- Supervised Learning – The data provided is labeled with the output variable. In the case of categorical labels, classification algorithms are used and in case of continuous labels, regression algorithms are used.
- Unsupervised Learning – The data provided is unlabeled and clustering algorithms are used to identify different groups in the data.
- Semi-Supervised Learning – Unlabeled data is grouped together and a new label is devised for the same. Facebook’s facial recognition is a popular example of semi-supervised learning. When the algorithm identifies that a face falls in a group of similar faces, it gets tagged with the respective person’s name if the person has been tagged even as low as twice or thrice.
- Reinforcement Learning- In this case, algorithms learn using feedback from the environment they are acting upon and get rewarded for correct predictions and penalized for incorrect ones.
For the introductory stage, we will commence with supervised and unsupervised learning techniques. In fact, even highly skilled professionals who have been engaged in their work for several years, continue to research and grow their knowledge in these techniques since these are the most common and relevant to most of our problems which seek solutions.
These are the models which come under supervised learning:
- Linear Regression
- Lasso and Ridge Regression
- Decision Tree Regressor
- Random Forest Regressor
- Support Vector Regressor
- Neural Networks
- Logistic Regression
- Naive Bayes Classifier
- Support Vector Classifier
- Decision Trees
- Boosted Trees
- Random Forest
- Neural Networks
- Nearest Neighbor
All these models might feel extremely overwhelming and hard to grasp, but with R’s extensively diverse libraries and ease of implementation, one can literally implement these algorithms in just a few lines of code. All one needs to have is a conceptual understanding of the algorithms such that the model can be tweaked sensibly as per requirement. You can follow our Data Science course to build up your concepts from scratch to excellence.
Now let us explore this extraordinary language to enhance our machine learning experience!
What is R?
R was a language essentially developed for scientists and mathematicians/statisticians who could easily explore complex data with relative ease and track recurring patterns and trends at a much higher pace when compared to traditional techniques. With the evolution of Data Science, R took a leap and started serving the corporate and IT sector along with the academic sector. This happened when skilled statisticians and data experts started migrating into IT when they found sprouting opportunities there to harness their skills in the industry. They brought along R with themselves and set a milestone right where they stood.
Is R as Relevant as Python?
There is a constant debate as to whether Python is more competent and relevant that R. It must be made clear that this is mostly a fruitless discussion since both these languages are founding pillars of advanced Data Science and Machine Learning. R started evolving from a mathematical perspective and Python from a programming perspective, but they have come to serve the same purpose of solving analytical problems, and have competently done so for several years. It is simply one’s choice of comfort to move along with either of them.
What are the Basic Operations in R with Respect to Machine Learning?
In order to solve machine learning problems, one has to explore a bit further than plain programming. R provides a series of libraries which needs to be kept at hand while exploring myriad data in order to minimize obstacles while analysis.
R can do the following operations on Data related structures:
Vectors can be compared to lists or columns which can store a series of data of similar type. They can be compared to arrays in general programming terms. Vectors can be implemented using the following code:
Vector1 = c(93,34,6.7,10)
R supports several operations in Vectors.
- Sequence Generation: sequence = c(1:100)
- Appending: vector1 = c(vector1,123)
- Vector Addition:
v1 = c(1,2,3,4)
v2 = c(9,8,7,6)
v1+v2 returns (10,10,10,10)
- Indexing: Indexing starts with 1 in case of R.
v1 will return 1
v1[c(1,3)] will return 1st and 3rd elements (1,3)
v1[1:3] will return 1st to 3rd elements (1,2,3)
Data Frames are data structures which read data directly into memory and saves it in a tabular and readable format. It is extremely easy to create data frames in R:
Vector1 = c(1,2,3,4)
Vector2 = c(‘a’,’b’,’c’,’d’)
R supports the following operations on data frames:
- The shape of the data frame (the number of rows and columns)
- Unique value counts of columns
- Addition of columns
- Deleting columns
- Sorting based on given columns
- Conditional selections
- Discovery and Deletion of Duplicates
Now let us explore data on a fundamental level with R and see a simple end to end process beginning from reading data to predicting the results. For this purpose, we will use a supervised machine learning approach for the time being.
Step 1: Read Data
quality = read.csv(‘quality.csv’)
You can collect this data from here. This data is for a classification task where the dependent variable or the variable to be predicted is ‘PoorCare’. The dataset has 14 columns overall including ‘MemberID’ which is the unique key identifier.
Step 2: Analyze the Dataset
Observe the different columns and their respective characteristics. This will help to formulate an initial idea about the data and help to devise useful techniques during the exploratory data analysis stage.
Code to get summarized description of the data:
Since this dataset is simple and small, we will not be going into a detailed analysis.
Step 3: Dividing Data into Training and Testing Sets
Every machine learning algorithm has some data it learns from and another set on which it quizzes itself to test the validity of its learning. These sets are called the training and testing sets respectively. This is how to go about creating them.
install.packages(“caTools”) #This library provides the essential functionality for splitting data
library(caTools)# Randomly split data
set.seed(88) #This is the initiation point for a random function to randomize from
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
This means 75% of the data will be allocated to the training set and the remaining 25% to the testing set. The variable ‘split’ now has a series of TRUE and FALSE values corresponding to samples in the record and have been randomly allocated. TRUE maps to a training set and FALSE to testing set.
#Create training and testing sets
qualityTrain = subset(quality, split == TRUE) #Selects all the records which has been assigned value ‘TRUE’ by the ‘split’ function
qualityTest = subset(quality, split == FALSE) #Selects all the records which has been assigned value ‘FALSE’ by the ‘split’ function
Step 4: Modeling
Since our problem is a classification problem, we will start with a basic supervised learning algorithm for classification: Logistic regression. The internal programming can be overlooked if need be but as was mentioned above, it is imperative to know the concept behind every model. Here is a simple overview of Logistic Regression:
Logistic regression is a linear model and follows the simple linear equation of y= mx+c. The only thing which differentiates it from a regression model is the sigmoid function which effectively divides the probability outputs and maps them to binary classes. One can even play with various thresholds to change the probability limit for classification. Multi class classification is also possible with the help of Logistic Regression and is implemented with a technique called the one-vs-all method. But that is out of scope for this article but will be taken up in another article which is a bit more advanced.
So let us train our first model!
# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics,data=qualityTrain, family=binomial) #The family argument specifies which model to use. ‘binomial’ means that the glm function will use a logistic regression model.
Call: glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, data = qualityTrain)
Step 5: Prediction
After the model is trained on the training set, we need to see how it performs on similar data. For this, we will use the test set.
predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)
To view or evaluate the results, a simple matrix called the confusion matrix can be used. It gives the count against true and predicted values:
table(qualityTest$PoorCare,predictTest >= 0.3)
#0.3 is the threshold value for the sigmoid function. If logistic regression gives probability outcome greater than 0.3, it will be predicted as belonging to class 1, otherwise 0.
0 19 5
1 2 6
From this confusion matrix, a series of evaluation metrics can be calculated. Some of the primary ones are as follows:
- F1 score
Based on the problem’s demand, the appropriate evaluation metric needs to be selected such that the model can be optimized accordingly and the threshold values can be decided.
This was a very simple pipeline of how a machine learning problem is solved and only offers a peek into the efficiency of R as a language.R has several more functionalities and libraries which can perform advanced tasks in few simple lines of code. It not only helps the programmers to easily accomplish desired tasks but also increases the time and memory efficiency of the code since R libraries are optimized by experts. Detailed and more in-depth discussions and explanations on various other models and their optimization techniques can be found in our Data Science courses and blogs!
Follow this link, if you are looking to learn data science online!
You can follow this link for our Big Data course, which is a step further into advanced data analysis and processing!
Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course
Furthermore, if you want to read more about data science, read our Data Science Blogs