Machine Learning is the study of statistics and algorithms which help computers to arrive at conclusions without any external guidance, solely depending upon recurring trends and patterns in the available data.
Machine Learning follows various techniques to solve essential problems. They are as follows:
Supervised Learning – The data provided is labeled with the output variable. In the case of categorical labels, classification algorithms are used and in case of continuous labels, regression algorithms are used.
Unsupervised Learning – The data provided is unlabeled and clustering algorithms are used to identify different groups in the data.
Semi-Supervised Learning – Unlabeled data is grouped together and a new label is devised for the same. Facebook’s facial recognition is a popular example of semi-supervised learning. When the algorithm identifies that a face falls in a group of similar faces, it gets tagged with the respective person’s name if the person has been tagged even as low as twice or thrice.
Reinforcement Learning- In this case, algorithms learn using feedback from the environment they are acting upon and get rewarded for correct predictions and penalized for incorrect ones.
For the introductory stage, we will commence with supervised and unsupervised learning techniques. In fact, even highly skilled professionals who have been engaged in their work for several years, continue to research and grow their knowledge in these techniques since these are the most common and relevant to most of our problems which seek solutions.
These are the models which come under supervised learning:
Lasso and Ridge Regression
Decision Tree Regressor
Random Forest Regressor
Support Vector Regressor
Naive Bayes Classifier
Support Vector Classifier
All these models might feel extremely overwhelming and hard to grasp, but with R’s extensively diverse libraries and ease of implementation, one can literally implement these algorithms in just a few lines of code. All one needs to have is a conceptual understanding of the algorithms such that the model can be tweaked sensibly as per requirement. You can follow our Data Science course to build up your concepts from scratch to excellence.
Now let us explore this extraordinary language to enhance our machine learning experience!
What is R?
R was a language essentially developed for scientists and mathematicians/statisticians who could easily explore complex data with relative ease and track recurring patterns and trends at a much higher pace when compared to traditional techniques. With the evolution of Data Science, R took a leap and started serving the corporate and IT sector along with the academic sector. This happened when skilled statisticians and data experts started migrating into IT when they found sprouting opportunities there to harness their skills in the industry. They brought along R with themselves and set a milestone right where they stood.
Is R as Relevant as Python?
There is a constant debate as to whether Python is more competent and relevant that R. It must be made clear that this is mostly a fruitless discussion since both these languages are founding pillars of advanced Data Science and Machine Learning. R started evolving from a mathematical perspective and Python from a programming perspective, but they have come to serve the same purpose of solving analytical problems, and have competently done so for several years. It is simply one’s choice of comfort to move along with either of them.
What are the Basic Operations in R with Respect to Machine Learning?
In order to solve machine learning problems, one has to explore a bit further than plain programming. R provides a series of libraries which needs to be kept at hand while exploring myriad data in order to minimize obstacles while analysis.
R can do the following operations on Data related structures:
Vectors can be compared to lists or columns which can store a series of data of similar type. They can be compared to arrays in general programming terms. Vectors can be implemented using the following code:
Vector1 = c(93,34,6.7,10)
R supports several operations in Vectors.
Sequence Generation: sequence = c(1:100)
Appending: vector1 = c(vector1,123)
v1 = c(1,2,3,4)
v2 = c(9,8,7,6)
v1+v2 returns (10,10,10,10)
Indexing: Indexing starts with 1 in case of R.
v1 will return 1
v1[c(1,3)] will return 1st and 3rd elements (1,3)
v1[1:3] will return 1st to 3rd elements (1,2,3)
Data Frames are data structures which read data directly into memory and saves it in a tabular and readable format. It is extremely easy to create data frames in R:
Vector1 = c(1,2,3,4)
Vector2 = c(‘a’,’b’,’c’,’d’)
R supports the following operations on data frames:
The shape of the data frame (the number of rows and columns)
Unique value counts of columns
Addition of columns
Sorting based on given columns
Discovery and Deletion of Duplicates
Now let us explore data on a fundamental level with R and see a simple end to end process beginning from reading data to predicting the results. For this purpose, we will use a supervised machine learning approach for the time being.
Step 1: Read Data
quality = read.csv(‘quality.csv’)
You can collect this data from here. This data is for a classification task where the dependent variable or the variable to be predicted is ‘PoorCare’. The dataset has 14 columns overall including ‘MemberID’ which is the unique key identifier.
Step 2: Analyze the Dataset
Observe the different columns and their respective characteristics. This will help to formulate an initial idea about the data and help to devise useful techniques during the exploratory data analysis stage.
Code to get summarized description of the data:
Since this dataset is simple and small, we will not be going into a detailed analysis.
Step 3: Dividing Data into Training and Testing Sets
Every machine learning algorithm has some data it learns from and another set on which it quizzes itself to test the validity of its learning. These sets are called the training and testing sets respectively. This is how to go about creating them.
install.packages(“caTools”) #This library provides the essential functionality for splitting data
library(caTools)# Randomly split data
set.seed(88) #This is the initiation point for a random function to randomize from
This means 75% of the data will be allocated to the training set and the remaining 25% to the testing set. The variable ‘split’ now has a series of TRUE and FALSE values corresponding to samples in the record and have been randomly allocated. TRUE maps to a training set and FALSE to testing set.
#Create training and testing sets
qualityTrain = subset(quality, split == TRUE) #Selects all the records which has been assigned value ‘TRUE’ by the ‘split’ function
qualityTest = subset(quality, split == FALSE) #Selects all the records which has been assigned value ‘FALSE’ by the ‘split’ function
Step 4: Modeling
Since our problem is a classification problem, we will start with a basic supervised learning algorithm for classification: Logistic regression. The internal programming can be overlooked if need be but as was mentioned above, it is imperative to know the concept behind every model. Here is a simple overview of Logistic Regression:
Logistic regression is a linear model and follows the simple linear equation of y= mx+c. The only thing which differentiates it from a regression model is the sigmoid function which effectively divides the probability outputs and maps them to binary classes. One can even play with various thresholds to change the probability limit for classification. Multi class classification is also possible with the help of Logistic Regression and is implemented with a technique called the one-vs-all method. But that is out of scope for this article but will be taken up in another article which is a bit more advanced.
So let us train our first model!
# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics,data=qualityTrain, family=binomial) #The family argument specifies which model to use. ‘binomial’ means that the glm function will use a logistic regression model.
Call: glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, data = qualityTrain)
Step 5: Prediction
After the model is trained on the training set, we need to see how it performs on similar data. For this, we will use the test set.
predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)
To view or evaluate the results, a simple matrix called the confusion matrix can be used. It gives the count against true and predicted values:
table(qualityTest$PoorCare,predictTest >= 0.3)
#0.3 is the threshold value for the sigmoid function. If logistic regression gives probability outcome greater than 0.3, it will be predicted as belonging to class 1, otherwise 0.
0 19 5
1 2 6
From this confusion matrix, a series of evaluation metrics can be calculated. Some of the primary ones are as follows:
Based on the problem’s demand, the appropriate evaluation metric needs to be selected such that the model can be optimized accordingly and the threshold values can be decided.
This was a very simple pipeline of how a machine learning problem is solved and only offers a peek into the efficiency of R as a language.R has several more functionalities and libraries which can perform advanced tasks in few simple lines of code. It not only helps the programmers to easily accomplish desired tasks but also increases the time and memory efficiency of the code since R libraries are optimized by experts. Detailed and more in-depth discussions and explanations on various other models and their optimization techniques can be found in our Data Science courses and blogs!
Now, in theory, it is possible to become a data scientist, without paying a dime. What we want to do in this article is to list out the best of the best options to learn what you need to know to become a data scientist. Many articles offer 4-5 courses under each heading. What I have done is to search through the Internet covering all free courses and choose the single best course for each topic.
These courses have been carefully curated and offer the best possible option if you’re learning for free. However – there’s a caveat. An interesting twist to this entire story. Interested? Read on! And please – make sure you complete the full article.
Topics For A Data Scientist Course
The basic topics that a data scientist needs to know are:
Machine Learning Theory and Applications
Statistics & Probability
Calculus Basics (short)
Machine Learning in Python
Machine Learning in R
So let’s get to it. Here is the list of the best possible options to learn every one of these topics, carefully selected and curated.
Machine Learning – Stanford University – Andrew Ng (audit option)
The world-famous course for machine learning with the highest rating of all the MOOCs in Coursera, from Andrew Ng, a giant in the ML field and now famous worldwide as an online instructor. Uses MATLAB/Octave. From the website:
This course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include:
(ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning)
(iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI)
The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.
This course is extremely effective and has many benefits. However, you will need high levels of self-discipline and self-motivation. Statistics show that90% of those who sign up for a MOOC without a classroom or group environment never complete the course.
Learn Python The Hard Way – Zed Shaw – Free Online Access
You may ask me, why do I want to learn the hard way? Shouldn’t we learn the smart way and not the hard way? Don’t worry. This ebook, online course, and web site is a highly popular way to learn Python. Ok, so it says the hard way. Well, the only way to learn how to code is to practice what you have learned. This course integrates practice with learning. Other Python books you have to take the initiative to practice.
Here, this book shows you what to practice, how to practice. There is only one con here – although this is the best self-driven method, most people will not complete all of it. The main reason is that there is no external instructor for supervision and a group environment to motivate you. However, if you want to learn Python by yourself, then this is the best way. But not the optimal one, as you will see at the end of this article since the cost of the book is 30$ USD (2100 INR approx).
Interactive R and Data Science Programming – SwiRl
Swirlstats is a wonderful tool to learn R and data science scripting in R interactively and intuitively by teaching you R commands from within the R console. This might seem like a very simple tool, but as you use it, you will notice its elegance in teaching you literally how to express yourselves in R and the finer nuances of the language and integration with the console and tidyverse. This is a powerful method of learning R and what is more, it is also a lot of fun!
KhanAcademy is a free non-profit organization on a mission – they want to provide a world-class education to you regardless of where you may be in the world. And they’re doing a fantastic job! This course has been covered in several very high profile blogs and Quora posts as the best online course for statistics – period. What is more, it is extremely high quality and suitable for beginners – and – free! This organization is doing wonderful work. More power to them!
Mathematics for Data Science
Now the basic mathematics for data science content includes linear algebra, single-variable, discrete mathematics, and multivariable calculus (selected topics) and basics of differential equations. Now you could take all of these topics separately in KhanAcademy and that is a good option for Linear Algebra and Multivariate Calculus (in addition to Statistics and Probability).
For Linear Algebra, the link of what you need to know given in a course in KhanAcademy is given below:
These courses are completely free and very accessible to beginners.
This topic deserves a section to itself because discrete mathematics is the foundation of all computer science. There are a variety of options available to learn discrete mathematics, from ebooks to MOOCs, but today, we’ll focus on the best possible option. MIT (Massachusetts Institute of Technology) is known as one of the best colleges in the world and they have an Open information initiative known as MIT OpenCourseWare (MIT OCW). These are actual videos of the lectures taken by the students at one of the best engineering colleges in the world. You will benefit a lot if you follow the lectures at this link, they give all the basic concepts as clearly as possible. It’s a bit technical because this is open mostly for students at an advanced level. The link is given below:
It is also technical and from MIT but might be a little more accessible than the earlier option.
SQL (see-quel) or Structured Query Language is a must-learn if you are a data scientist. You will be working with a lot of databases, and SQL is the language used to access and generate data from database systems like Oracle and Microsoft SQL Server. The best free course I could find online is undoubtedly the one below:
We have covered Python, R, Machine Learning using MATLAB, Data Science with R (SwiRl teaches data science as well), Statistics, Probability, Linear Algebra, and Basic Calculus. Now we just need to get a course for Data Science with Python, and we are done! Now I looked at many options but was not satisfied. So instead of a course, I have provided you with a link to the scikit-learn documentation. Why?
Because that’s as good as an online course by itself. If you read through the main sections, get the code (Ctrl-X, Ctrl-V) and execute it in an Anaconda environment, and then play around with it, experiment, and observe and read up on what every line does, you will already know who to solve standard textbook problems. I recommend the following order:
This book is free to learn online. Get the data files, get the script files, use RStudio, and just as with Python, play, enjoy, experiment, execute, and explore. A little hard work will have you up and running with R in no time! But make sure you try as many code examples as possible. The libraries you can focus on are:
dplyr (data manipulation)
tidyr (data preprocessing “tidying”)
ggplot2 (graphical package)
purrr (functional toolkit)
readr (reading rectangular data files easily)
stringr (string manipulation)
To make it short, simple, and sweet, since we have already covered SQL and this content is for beginners, I recommend the following course:
This is a course on Udemy rated 4.2/5 and completely free. You will learn everything you need to work with Tableau (the most commonly used corporate-level visualization tool). This is an extremely important part of your skill set. You can make all the greatest analyses, but if you don’t visualize them and do it well, management will never buy into your machine learning solution, and neither will anyone who doesn’t know the technical details of ML (which is a large set of people on this planet). Visualization is important. Please make sure to learn the basics (at least!) of Tableau.
Kaggle Micro-Courses (Add-Ons – Short Concise Tutorials)
Kaggle is a wonderful site to practice your data science skills, but recently, they have added a set of hands-on courses to learn data science practicals. And, if I do say, so myself, it’s brilliant. Very nicely presented, superb examples, clear and concise explanations. And of course, you will cover more than we discussed earlier. Please, if you read through all the courses discussed so far in this article, and if you do just the courses at Kaggle.com, you will have spent your time wisely (though not optimally – as we shall see).
Now, if you are reading this article, you might have a fundamental question. This is a blog of a company that offers courses in data science, deep learning, and cloud computing. Why would we want to list all our competitors and publish it on our site? Isn’t that negative publicity?
Quite the opposite.
This is the caveat we were talking about.
Our course is a better solution than every single option given above!
We have nothing to hide.
And we have an absolutely brilliant top-class product.
Every option given above is a separate course by itself.
And they all suffer from a very prickly problem – you need to have excellent levels of discipline and self-motivation to complete just one of the courses above – let alone all ten.
You also have no classroom environment, no guidance for doubts and questions, and you need to know the basics about programming.
Our product is the most cost-effective option in the market for learning data science, as well as the most effective methodology for everyone – every course is conducted live in a classroom environment from the comfort of your home. You can work at a standard job, spend two hours on the internet every day, do extra work and reading on weekends, and become a professional data scientist in 6 months time.
We also have personalized GitHub project portfolio creation, management, and faculty guidance. Not to mention individual attention for each student.
And IITians for faculty who also happen to have 9+ years of industry experience.
So when we say that our product is the best on the market, we really mean it. Because of the live session teaching of the classes, which no other option on the Internet today has.
Am I kidding? Absolutely not. And you can get started with Dimensionless Technologies Data Science with Python and R course for just 70-odd USD. Which is the most cost-effective option on the market!
And unlike all the 10 courses and resources detailed above, instead of doing 10 courses, you just need to do one single course, with the extracted meat of all that you need to know as a data scientist. And yes, we cover:
Statistics & Probability
Machine Learning in Python
Machine Learning in R
GitHub Personal Project Portfolio Creation
Live Remote Daily Sessions
Experts with Industrial Experience
A Classroom Environment (to keep you motivated)
Individual Attention to Every Student
I hope this information has you seriously interested. Please sign up for the course – you will not regret it.
And we even have a two-week trial for you to experience the course for yourself.
Choose wisely and optimally.
Unleash the data scientist within!
An excellent general article on emerging state-of-the-art technology, AI, and blockchain:
Data Science is everyone’s word of the mouth in the current analytical eco-space. The study of Data Science which encompasses various subjects like Machine Learning, Deep Learning, Artificial Intelligence, Natural Language Processing, and so on has made tremendous advancement in the recent past.
Data Science is not something that emerged recently. It was there since computers were invented as the first Data Science application was classifying an email as Spam or Not Spam based on certain trends in the mail. However, the recent hype is a result of the massive amounts of data that are available, and the huge computational capacity that modern computers possess.
In terms of career, Data Science is considered as one of the most lucrative jobs in the 21st with salaries next to none. Hence, out of the curiosity to mine insights from the data, and also for a better career, professionals from various disciplines such as Healthcare, Physics, Marketing, Human Resource, IT, want to master the state-of-the-art Data Science methodologies.
To be called a Full Stack Data Scientist, one needs to master a plethora of skills as mentioned below.
Statistics and Probability – The first, and arguably the most important part of Data Science as various statistical methods are used to make assumptions from the data.
Programming – One needs to master at least one programming language out of Python, R, and SAS.
Machine Learning – To make predictions from the data, one needs to be aware of the several programmed algorithms, and understand their usage for the right application.
Communication – Extracting insights from the data are useless unless it is communicated in layman terms to the business and the stakeholders who would make crucial decisions based on your analysis.
Apart from these four basic skills, there are few other skills like building data pipelines are also important, but on most occasions, an organization would have a separate team for that.
Why Programming is Needed for Data Science?
In layman terms, Data Science is a process of automating certain manual tasks to mitigate the resource, budget, and time constraints. Thus learning to code is an important component to automate those tasks.
To build a simple predictive model, the data set should be first loaded and cleaned. There are several libraries, and packages available for that. You need to choose the language to code, and use those libraries for such operations. After the data is cleaned, there are several programmed algorithms which need to be used to build the predictive model.
Now, each algorithm is a set of a class which needs to be imported first, and then an object is created for that class which would use the methods or the functions associated with that particular class. Thus this entire process is a concept of Object Oriented Programming. Even, to understand the process behind the algorithms, one needs to be familiar with programming
Why R Programming is Used?
There is an ongoing debate about which is the best programming language for Data Science. It never harms to master all the three languages but one needs to be expert in a particular language, and understand its various functionalities in different situations.
The choice of language depends on interest, and how comfortable the person is to program in that language. Python is generally considered as the Holy Grail due to its simplicity, flexibility, and the huge community which makes it easier to find solutions to all sorts of problems faced during the building stage. However, R is not far behind either as people from different backgrounds other than IT, seems to prefer R, as their go-to language for Data Science.
R is an open-source programming language which is supported by the R Foundation and is used in statistical computing, and graphics. Like Python, it is easy to install and is better than SAS which however is high-level, and easy to learn designed additionally for Data Manipulation.
The graphical representations and the statistical computations of the data gives R an edge over Python in this regard. Additionally, the programming environment of R has input, and output facilities, and several user-defined recursive functions. In the early ’90s, R was first developed, and since then its interface has been improved with constant efforts. R has made an outstanding journey from being a text editor to R studio, and now to the Jupyter Notebooks which has intrigued all the Data Scientist across the world.
Below are some of the key reasons why R is important in Data Science.
Academic Preference – R is one of the most popular languages in universities, and it is the language that many researchers use for their experiments. In fact, in several Data Science books, all the statistical analysis is done in R. This academic preference creates more people with the proficiency in R. As more students study R in their undergraduate or graduate courses, it would help them perform statistical analysis in the industry.
Data Pre-processing – Often the dataset used for analysis requires cleaning to make it ideal for building a model which is a time-consuming process. R comes to the rescue in such cases as it has several libraries, and packages to perform data wrangling. Some of its packages are-
dplyr – One of the popular R package used for data exploration, and transformation.
table – Data aggregation is simplified with this package as well as the computational time to manipulate the dataset is reduced.
readr – This package allows to read the various forms of data ten times faster due to the non-conversion of characters into factors.
Visualization – R allows the visualization of various structured or tabular data in graphical form. It has several tools which perform the task of analysis, visualization, and representation. ggplot2 is the most popular package in R for data visualization. ggedit is another package which users the aesthetics of a plot are correct.
Specificity – The goal of the R language is to make data analysis simpler, approachable, and accurate. As R is used for statistical analysis, it enables new statistical methods through its libraries. Moreover, the supportive community of R makes which helps one to get all the required solution of a problem. The discussion forums of R is next to none when it comes to statistical analysis. More often than not, there is an instant response to any question posted in the community which makes helps Data Scientists in their project.
Machine Learning – Exploratory data analysis is the first step in an end-to-end Data Science project where the data is wrangled and analyzed to extract insights through visualization. The next step is to build predictive models with the help of that cleaned data to solve various business problems. In Machine Learning, one needs to train the model first where it could capture the underlying trends in the data, and then make a prediction on the unknown data. R has a list of extensive tools which simplifies the process of developing the model to predict future events. Few of those packages are –
MICE – It deals with missing values in the data.
PARTY – To create Data partitions, this package is used.
CARET – The classification and regression problems could be solved with the CARET package.
randomFOREST – To create a decision tree.
Open Source – The open source feature of R makes it suitable to be run on any platform such as Windows, Linux, Mac, etc. In fact, there is an unlimited scope to play around with the R code without the hassle of cost, limits, license, and so on. Apart from a few libraries which are restricted to commercial access, rest could be accessed for free.
All-in-one Package Toolkit – Apart from standard tools which are used for various data analysis operations like transformation, aggregation, etc., R has several tools for statistical models like Regression, GLM, ANOVA which are included in a single object-oriented framework. Hence, instead of copy, and paste, this feature allows to extract the required information.
Availability – As R is an open-source programming language with a huge community, it has a plethora of learning resources making it ideal for anyone starting out in Data Science. Additionally, the exploration of the R landscape makes it easier to recruit R developers. R is rapidly growing in popularity and it would scale up in the future. Various techniques such as time-series modeling, regression, classification, clustering, etc., could be practiced with R making it an ideal choice for predictive analytics.
There are several companies who have used R in their applications. For example, the monitoring of user experience in Twitter is done in R. Also, in Microsoft, professionals use R on sales, marketing, Azure data. To forecast elections, and improve traditional reporting, the New York Times uses R language. In fact, R is used by Facebook as well for analyzing its 500TB of data. Companies like Nordstrom ensures customer delight by using R to deliver data-driven products.
Data Science is the sexiest job of the 21st century, and it would remain so for years to come. The exponential increase in the generation of data would only allow more development in the Data Science field, and there could be a gap in supply-demand at a certain age.
As several professionals are trying to enter this field, it is necessary that they first learn to programme, and R is an ideal language to start off their programming journey.
Dimensionless has several blogs and training to get started with R, and Data Science in general.
Follow this link, if you are looking to learn more about data science online!
Additionally, if you are having an interest in learning Data Science, Learnonline Data Science Course to boost your career in Data Science.
Furthermore, if you want to read more about data science, you can read our blogs here
There are a huge number of ML algorithms out there. Trying to classify them leads to the distinction being made in types of the training procedure, applications, the latest advances, and some of the standard algorithms used by ML scientists in their daily work. There is a lot to cover, and we shall proceed as given in the following listing:
1. Statistical Algorithms
Statistics is necessary for every machine learning expert. Hypothesis testing and confidence intervals are some of the many statistical concepts to know if you are a data scientist. Here, we consider here the phenomenon of overfitting. Basically, overfitting occurs when an ML model learns so many features of the training data set that the generalization capacity of the model on the test set takes a toss. The tradeoff between performance and overfitting is well illustrated by the following illustration:
Overfitting – from Wikipedia
Here, the black curve represents the performance of a classifier that has appropriately classified the dataset into two categories. Obviously, training the classifier was stopped at the right time in this instance. The green curve indicates what happens when we allow the training of the classifier to ‘overlearn the features’ in the training set. What happens is that we get an accuracy of 100%, but we lose out on performance on the test set because the test set will have a feature boundary that is usually similar but definitely not the same as the training set. This will result in a high error level when the classifier for the green curve is presented with new data. How can we prevent this?
Cross-Validation is the killer technique used to avoid overfitting. How does it work? A visual representation of the k-fold cross-validation process is given below:
The entire dataset is split into equal subsets and the model is trained on all possible combinations of training and testing subsets that are possible as shown in the image above. Finally, the average of all the models is combined. The advantage of this is that this method eliminates sampling error, prevents overfitting, and accounts for bias. There are further variations of cross-validation like non-exhaustive cross-validation and nested k-fold cross validation (shown above). For more on cross-validation, visit the following link.
There are many more statistical algorithms that a data scientist has to know. Some examples include the chi-squared test, the Student’s t-test, how to calculate confidence intervals, how to interpret p-values, advanced probability theory, and many more. For more, please visit the excellent article given below:
Classification refers to the process of categorizing data input as a member of a target class. An example could be that we can classify customers into low-income, medium-income, and high-income depending upon their spending activity over a financial year. This knowledge can help us tailor the ads shown to them accurately when they come online and maximises the chance of a conversion or a sale. There are various types of classification like binary classification, multi-class classification, and various other variants. It is perhaps the most well known and most common of all data science algorithm categories. The algorithms that can be used for classification include:
Support Vector Machines
Linear Discriminant Analysis
and many more. A short illustration of a binary classification visualization is given below:
For more information on classification algorithms, refer to the following excellent links:
Regression is similar to classification, and many algorithms used are similar (e.g. random forests). The difference is that while classification categorizes a data point, regression predicts a continuous real-number value. So classification works with classes while regression works with real numbers. And yes – many algorithms can be used for both classification and regression. Hence the presence of logistic regression in both lists. Some of the common algorithms used for regression are
Support Vector Regression
Partial Least-Squares Regression
For more on regression, I suggest that you visit the following link for an excellent article:
Both articles have a remarkably clear discussion of the statistical theory that you need to know to understand regression and apply it to non-linear problems. They also have source code in Python and R that you can use.
Clustering is an unsupervised learning algorithm category that divides the data set into groups depending upon common characteristics or common properties. A good example would be grouping the data set instances into categories automatically, the process being used would be any of several algorithms that we shall soon list. For this reason, clustering is sometimes known as automatic classification. It is also a critical part of exploratory data analysis (EDA). Some of the algorithms commonly used for clustering are:
Hierarchical Clustering – Agglomerative
Hierarchical Clustering – Divisive
K-Nearest Neighbours Clustering
EM (Expectation Maximization) Clustering
Principal Components Analysis Clustering (PCA)
An example of a common clustering problem visualization is given below:
The above visualization clearly contains three clusters.
Another excellent article on clustering refer the link
Dimensionality Reduction is an extremely important tool that should be completely clear and lucid for any serious data scientist. Dimensionality Reduction is also referred to as feature selection or feature extraction. This means that the principal variables of the data set that contains the highest covariance with the output data are extracted and the features/variables that are not important are ignored. It is an essential part of EDA (Exploratory Data Analysis) and is nearly always used in every moderately or highly difficult problem. The advantages of dimensionality reduction are (from Wikipedia):
It reduces the time and storage space required.
Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
It avoids the curse of dimensionality.
The most commonly used algorithm for dimensionality reduction is Principal Components Analysis or PCA. While this is a linear model, it can be converted to a non-linear model through a kernel trick similar to that used in a Support Vector Machine, in which case the technique is known as Kernel PCA. Thus, the algorithms commonly used are:
Ensembling means combining multiple ML learners together into one pipeline so that the combination of all the weak learners makes an ML application with higher accuracy than each learner taken separately. Intuitively, this makes sense, since the disadvantages of using one model would be offset by combining it with another model that does not suffer from this disadvantage. There are various algorithms used in ensembling machine learning models. The three common techniques usually employed in practice are:
Simple/Weighted Average/Voting: Simplest one, just takes the vote of models in Classification and average in Regression.
Bagging: We train models (same algorithm) in parallel for random sub-samples of data-set with replacement. Eventually, take an average/vote of obtained results.
Boosting: In this models are trained sequentially, where (n)th model uses the output of (n-1)th model and works on the limitation of the previous model, the process stops when result stops improving.
Stacking: We combine two or more than two models using another machine learning algorithm.
(from Amardeep Chauhan on Medium.com)
In all four cases, the combination of the different models ends up having the better performance that one single learner. One particular ensembling technique that has done extremely well on data science competitions on Kaggle is the GBRT model or the Gradient Boosted Regression Tree model.
We include the source code from the scikit-learn module for Gradient Boosted Regression Trees since this is one of the most popular ML models which can be used in competitions like Kaggle, HackerRank, and TopCoder.
In the last decade, there has been a renaissance of sorts within the Machine Learning community worldwide. Since 2002, neural networks research had struck a dead end as the networks of layers would get stuck in local minima in the non-linear hyperspace of the energy landscape of a three layer network. Many thought that neural networks had outlived their usefulness. However, starting with Geoffrey Hinton in 2006, researchers found that adding multiple layers of neurons to a neural network created an energy landscape of such high dimensionality that local minima were statistically shown to be extremely unlikely to occur in practice. Today, in 2019, more than a decade of innovation later, this method of adding addition hidden layers of neurons to a neural network is the classical practice of the field known as deep learning.
Deep Learning has truly taken the computing world by storm and has been applied to nearly every field of computation, with great success. Now with advances in Computer Vision, Image Processing, Reinforcement Learning, and Evolutionary Computation, we have marvellous feats of technology like self-driving cars and self-learning expert systems that perform enormously complex tasks like playing the game of Go (not to be confused with the Go programming language). The main reason these feats are possible is the success of deep learning and reinforcement learning (more on the latter given in the next section below). Some of the important algorithms and applications that data scientists have to be aware of in deep learning are:
Long Short term Memories (LSTMs) for Natural Language Processing
Recurrent Neural Networks (RNNs) for Speech Recognition
Convolutional Neural Networks (CNNs) for Image Processing
Deep Neural Networks (DNNs) for Image Recognition and Classification
Hybrid Architectures for Recommender Systems
Autoencoders (ANNs) for Bioinformatics, Wearables, and Healthcare
Deep Learning Networks typically have millions of neurons and hundreds of millions of connections between neurons. Training such networks is such a computationally intensive task that now companies are turning to the 1) Cloud Computing Systems and 2) Graphical Processing Unit (GPU) Parallel High-Performance Processing Systems for their computational needs. It is now common to find hundreds of GPUs operating in parallel to train ridiculously high dimensional neural networks for amazing applications like dreaming during sleep and computer artistry and artistic creativity pleasing to our aesthetic senses.
Artistic Image Created By A Deep Learning Network. From blog.kadenze.com.
For more on Deep Learning, please visit the following links:
In the recent past and the last three years in particular, reinforcement learning has become remarkably famous for a number of achievements in cognition that were earlier thought to be limited to humans. Basically put, reinforcement learning deals with the ability of a computer to teach itself. We have the idea of a reward vs. penalty approach. The computer is given a scenario and ‘rewarded’ with points for correct behaviour and ‘penalties’ are imposed for wrong behaviour. The computer is provided with a problem formulated as a Markov Decision Process, or MDP. Some basic types of Reinforcement Learning algorithms to be aware of are (some extracts from Wikipedia):
Q-Learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model (hence the connotation “model-free”) of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations. For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. “Q” names the function that returns the reward used to provide the reinforcement and can be said to stand for the “quality” of an action taken in a given state.
State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy. This name simply reflects the fact that the main function for updating the Q-value depends on the current state of the agent “S1“, the action the agent chooses “A1“, the reward “R” the agent gets for choosing this action, the state “S2” that the agent enters after taking that action, and finally the next action “A2” the agent choose in its new state. The acronym for the quintuple (st, at, rt, st+1, at+1) is SARSA.
3.Deep Reinforcement Learning
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. Remarkably, the computer agent DeepMind has achieved levels of skill higher than humans at playing computer games. Even a complex game like DOTA 2 was won by a deep reinforcement learning network based upon DeepMind and OpenAI Gym environments that beat human players 3-2 in a tournament of best of five matches.
For more information, go through the following links:
If reinforcement learning was cutting edge data science, AutoML is bleeding edge data science. AutoML (Automated Machine Learning) is a remarkable project that is open source and available on GitHub at the following link that, remarkably, uses an algorithm and a data analysis approach to construct an end-to-end data science project that does data-preprocessing, algorithm selection,hyperparameter tuning, cross-validation and algorithm optimization to completely automate the ML process into the hands of a computer. Amazingly, what this means is that now computers can handle the ML expertise that was earlier in the hands of a few limited ML practitioners and AI experts.
AutoML has found its way into Google TensorFlow through AutoKeras, Microsoft CNTK, and Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS). Currently it is a premiere paid model for even a moderately sized dataset and is free only for tiny datasets. However, one entire process might take one to two or more days to execute completely. But at least, now the computer AI industry has come full circle. We now have computers so complex that they are taking the machine learning process out of the hands of the humans and creating models that are significantly more accurate and faster than the ones created by human beings!
The basic algorithm used by AutoML is Network Architecture Search and its variants, given below:
Network Architecture Search (NAS)
PNAS (Progressive NAS)
ENAS (Efficient NAS)
The functioning of AutoML is given by the following diagram:
If you’ve stayed with me till now, congratulations; you have learnt a lot of information and cutting edge technology that you must read up on, much, much more. You could start with the links in this article, and of course, Google is your best friend as a Machine Learning Practitioner. Enjoy machine learning!
So you want to learn data science but you don’t know where to start? Or you are a beginner and you want to learn the basic concepts? Welcome to your new career and your new life! You will discover a lot of things on your journey to becoming a data scientist and being part of a new revolution. I am a firm believer that you can learn data science and become a data scientist regardless of your age, your background, your current knowledge level, your gender, and your current position in life. I believe – from experience – that anyone can learn anything at any stage in their lives. What is required is just determination, persistence, and a tireless commitment to hard work. Nothing else matters as far as learning new things – or learning data science – is concerned. Your commitment, persistence, and your investment in your available daily time is enough.
I hope you understood my statement. Anyone can learn data science if you have the right motivation. In fact, I believe anyone can learn anything at any stage in their lives, if they invest enough time, effort and hard work into it, along with your current occupation. From my experience, I strongly recommend that you continue your day job and work on data science as a side hustle, because of the hard work that will be involved. Your commitment is more important than your current life situation. Carrying on a full-time job and working on data science part-time is the best way to go if you want to learn in the best possible manner.
Technical Concepts of Data Science
So what are the important concepts of data science that you should know as a beginner? They are, in order of sequential learning, the following:
Statistics & Probability
Data Preparation and Data ETL*
Machine Learning with Python and R
Data Visualization and Summary
*Extraction, Transformation, and Loading
Now if you were to look at the above list an go to a library, you would, most likely, come back with 9-10 books at an average of 1000 pages each. Even if you could speed-read, 10,000 pages is a lot to get through. I could list the best books for each topic in this post, but even the most seasoned reader would balk at 10,000 pages. And who reads books these days? So what I am going to give you is a distilled extract on each of those topics. Keep in mind, however, that every topic given above could be a series of blog posts in its own right, and these 80-word paragraphs are just a tiny taste of each topic and there is an ocean of depth involved in every topic. You might ask if that is the case, how can everybody be a possible candidate for data scientist role? Two words: Persistence and Motivation. With the right amount of these two characteristics, anyone can be anything they want to be.
1) Python Programming:
Python is one of the most popular programming languages in the world. It is the ABC of data science because Python is the language every beginner starts with on data science. It is universally used for any purposes since it is so amazingly versatile. Python can be used for web applications and websites with Django, microservices with Flask, general programming projects with the standard library from PyPI, GUIs with PyQt5 or Tkinter, Interoperability with Jython (Java), Cython (C) and nearly other programming language are available today.
Of course, Python is the also first language used for data science with the standard stack of scikit-learn (machine learning), pandas (data manipulation), matplotlib and seaborn (visualization) and numpy (vectorized computation). Nowadays, the most common technology used is the Anaconda distribution, available from www.anaconda.com. Current version 2018.12 or Anaconda Distribution 5. To learn more about Python, I strongly recommend the following books: Head First Python and the Python Cookbook.
2) R Programming
R is The Best Language for statistical needs since it is a language designed by statisticians, for statisticians. If you know statistics and mathematics well, you will enjoy programming in R. The language gives you the best support available for every probability distribution, statistics functions, mathematical functions, plotting, visualization, interoperability, and even machine learning and AI. In fact, everything that you can do in Python can be done in R. R is the second most popular language for data science in the world, second only to Python. R has a rich ecosystem for every data science requirement and is the favorite language of academicians and researchers in the academic domain.
Learning Python is not enough to be a professional data scientist. You need to know R as well. A good book to start with is R For Data Science, available at Amazon at a very reasonable price. Some of the most popular packages in R that you need to know are ggplot2, ThreeJS, DT (tables), network3D, and leaflet for visualization, dplyr and tidyr for data manipulation, shiny and R Markdown for reporting, parallel, Rcpp and data.table for high performance computing and caret, glmnet, and randomForest for machine learning.
3) Statistics and Probability
This is the bread and butter of every data scientist. The best programming skills in the world will be useless without knowledge of statistics. You need to master statistics, especially practical knowledge as used in a scientific experimental analysis. There is a lot to cover. Any subtopic given below can be a blog-post in its own right. Some of the more important areas that a data scientist needs to master are:
Succinctly, linear algebra is about vectors, matrices and the operations that can be performed on vectors and matrices. This is a fundamental area for data science since every operation we do as a data scientist has a linear algebra background, or, as data scientists, we usually work with collections of vectors or matrices. So we have the following topics in Linear Algebra, all of which are covered in the following world-famous book, Linear Algebra and its Applications by Gilbert Strang, an MIT professor. You can also go to the popular MIT OpenCourseWare page, Linear Algebra (MIT OCW). These two resources cover everything you need to know. Some of the most fundamental concepts that you can also Google or bring up on Wikipedia are:
5) Data Preparation and Data ETL (Extraction, Transformation, and Loading)
By IAmMrRob on Pixabay
Yes – welcome to one of the more infamous sides of data science! If data science has a dark side, this is it. Know for sure that unless your company has some dedicated data engineers who do all the data munging and data wrangling for you, 90% of your time on the job will be spent on working with raw data. Real world data has major problems. Usually, it’s unstructured, in the wrong formats, poorly organized, contains many missing values, contains many invalid values, and contains types that are not suitable for data mining.
Dealing with this problem takes up a lot of the time of a data scientist. And your data scientist’s analysis has the potential to go massively wrong when there is invalid and missing data. Practically speaking, unless you are unusually blessed, you will have to manage your own data, and that means conducting your own ETL (Extraction, Transformation, and Loading). ETL is a data mining and data warehousing term that means loading data from an external data store or data mart into a form suitable for data mining and in a state suitable for data analysis (which usually involves a lot of data preprocessing). Finally, you often have to load data that is too big for your working memory – a problem referred to as external loading. During your data wrangling phase, be sure to look into the following components:
Automating the Data ETL Pipeline
Automation of Data Validation and Verification
Usually, expert data scientists try to automate this process as much as possible, since a human being would be wearied by this task very fast and is remarkably prone to errors, which will not happen in the case of a Python or an R script doing the same operations. Be sure to try to automate every stage in your data processing pipeline.
6) Machine Learning with Python and R
An expert machine learning scientist has to be proficient in the following areas at the very least:
Data Science Topics Listing – Thomas
Now if you are just starting out in Machine Learning (ML), Python, and R, you will gain a sense of how huge the field is and the entire set of lists above might seem more like advanced Greek instead of Plain Jane English. But not to worry; there are ways to streamline your learning and to consume as little time as possible in learning or becoming able to learn nearly every single topic given above. After you learn the basics of Python and R, you need to go on to start building machine learning models. From experience, I suggest you break up your time into 50% of Python and 50% of R and spend as much time as possible spending time without switching your languages or working between languages. What do I mean? Spend maximum time learning one programming language at one time. That will prevent syntax errors and conceptual errors and language confusion problems.
Now, on the job, in real life, it is much more likely that you will work in a team and be responsible for only one part of the work. However, if your working in a startup or learning initially, you will end up doing every phase of the work yourself. Be sure to give yourself time to process information and to spend sufficient time for your brain to rest and get a handle on the topics you are trying to learn. For more info, do check out the Learning How to Learn MOOC on Coursera, which is the best way to learn mathematical or scientific topics without ending up with burn out. In fact, I would recommend this approach to every programmer out there trying to learn a programming language, or anything considered difficult, like Quantum Mechanics and Quantum Computation or String Theory, or even Microsoft F# or Microsoft C# for a non-Java programmer.
Common tools that you have with which you can produce powerful visualizations include:
Google Data Studio
Microsoft Power BI Desktop
Some involve coding, some are drag-and-drop, some are difficult for beginners, some have no coding at all. All of these tools will help you with data visualization. But one of the most overlooked but critical practical functions of a data scientist has been included under this heading: summarisation.
Summarisation means the practical result of your data science workflow. What does the result of your analysis mean for the operation of the business or the research problem that you are currently working on? How do you convert your result to the maximum improvement for your business? Can you measure the impact this result will have on the profit of your enterprise? If so, how? Being able to come out of a data science workflow with this result is one of the most important capacities of a data scientist. And most of the time, efficient summarisation = excellent knowledge of statistics. Please know for sure that statistics is the start and the end of every data science workflow. And you cannot afford to be ignorant about it. Refer to the section on statistics or google the term for extra sources of information.
How Can I Learn Everything Above In the Shortest Possible Time?
You might wonder – How can I learn everything given above? Is there a course ora pathway to learn every single concept described in this article at one shot? It turns out – there is. There is a dream course for a data scientist that contains nearly everything talked about in this article.
Want to Become a Data Scientist? Welcome to Dimensionless Technologies! It just so happens that the course: Data Science using Python and R, a ten-week course that includes ML, Python and R programming, Statistics, Github Account Project Guidance, and Job Placement, offers nearly every component spoken about above, and more besides. You don’t know to buy the books or do any of the courses other than this to learn the topics in this article. Everything is covered by this single course, tailormade to convert you to a data scientist within the shortest possible time. For more, I’d like to refer you to the following link:
Does this seem too good to be true? Perhaps, because this is a paid course. With a scholarship concession, you could end up paying around INR 40,000 for this ten-week course, two weeks of which you can register for 5,000 and pay the remainder after two weeks trial period to see if this course really suits you. If it doesn’t, you can always drop out after two weeks and be poorer by just 5k. But in most cases, this course has been found to carry genuine worth. And nothing worthwhile was achieved without some payment, right?
In case you want to learn more about data science, please check out the following articles:
Python and R have been around for well over 20 years. Python was developed in 1991 by Guido van Rossum, and R in 1995 by Ross Ihaka and Robert Gentleman. Both Python and R have seen steady growth year after year in the last two decades. Will that trend continue, or are we coming to an end of an era of the Python-R dominance in the data science segment? Let’s find out!
Python in the last decade has grown from strength to strength. In 2013, Python overtook R as the most popular language used fordata science, according to the Stack Overflow developer survey (Link).
Will Python’s Dominance Continue?
We believe, yes, definitely. Two words – data science.
Data science is such a hot and happening field right now, and the data scientist job is hyped as the ‘sexiest job of the 21st century‘, according to Forbes. Python is by far the most popular language for data science. The only close competitor is R, which Python overtook in the KDNuggets data science survey of 2016 . As shown in the link, in 2018, Python held 65.6% of the data science market, and R was actually below RapidMiner, at 48.5%. From the graphs, it is easy to see that Python is eating away at R’s share in the market. But why?
In 2018, we say a huge push towards advancement across all verticals in the industry due to deep learning. And what is the most famous tool for deep learning? TensorFlow and Keras – both Python-based frameworks! While we have Keras and TensorFlow interfaces in R and RStudio now, Python was the initial choice and is still the native library – kerasR and tensorflow in RStudio being interfaces to the Python packages. Also, a real-life implementation of a deep learning project contains more than the deep learning model preparation and data analysis.
There is the data preprocessing, data cleaning, data wrangling, data preparation, outlier detection and missing data values management section which is infamous for taking up 99% of the time of a data scientist, with actual deep learning model work taking just 1% or less of their on-duty time! And what language is used for this commonly? For general purpose programming, Python is the goto language in most cases. I’m not saying that R doesn’t have data preprocessing packages. I’m saying that standard data science operations like web scraping are easier in Python than in R.And hence Python will be the language used in most cases, except in the statistics and the university or academic fields.
Our prediction for Python – growth – even to 70% of the data science market as more and more research-level projects like AutoML keep using Python as a first language of choice.
What About R?
In 2016, the use of R for data science in the industry was 55%, and Python stood at 51%. Python increased by 33% and R decreased by 25% in 2 years. Will that trend continue and will R continue on its downward spiral? I believe perhaps in figures, but not in practice. Here’s why.
Data science is at its heart, the field of the statistician. Unless you have a strong background in statistics, you will be unable to process the results of your experiments, especially in concepts like p-values, tests of significance, confidence intervals, and analysis of experiments. And R is the statistician’s language.Statistics and mathematics students will always find working in R remarkably easy and simple, which explains its popularity in academia. R programming lends itself to statistics. Python lends itself to model building and decent execution performance (R can be 4x slower). R, however, excels in statistical analysis. So what is the point that I am trying to express?
Simple – Python and R are complementary. They are best used together. You will find that knowledge of both Python and R will suit you best for most projects. You need to learn both. You can find this trend expressed in every article that speaks about becoming a data science unicorn – knowledge of both Python and R is required as a norm.
Yes, R is having a downturn in popularity. However, due to the complementary nature of the tools, I believe that R will have a part to play in the data scientist’s toolbox, even if it does dip a bit in growth in the years to come. Very simply, R is too convenient for a statistician to be neglected by the industry completely. It will continue to have its place in the toolbox. And yes; deep learning is now practical in R with support for Keras and AutoML as well as of right now.
Dimensionless Technologies is the market leader as far as training in AI, cloud, deep learning and data science in Python and R is concerned. Of course, you don’t have to spend 40k for a data science certification, you could always go for its industry equivalent – 100-120 lakhs for a US university’s Ph.D. research doctorate! What Dimensionless Technologies has as an advantage over its closest rival – (Coursera’s John Hopkins University’s Data Science Specialization) – is:
Live Video Training
The videos that you get on Coursera, edX, Dataquest, MIT OCW (MIT OpenCourseWare), Udacity, Udemy, and many other MOOCs have a fundamental flaw – they are NOT live! If you have a doubt in a video lecture, you only have the comments as a communication tool to the lectures. And when over 1,000 students are taking your class, it is next to impossible to respond to every comment. You will not and cannot get personalized attention for your doubts and clarifications. This makes it difficult for many, especially Indian students who may not be used to foreign accents to have a smooth learning curve in the popular MOOCs available today.
Try Before You Buy Fully
Dimensionless Technologies offers 20 hours of the course for Rs 5000, with the remaining 35k (10k of 45k waived if you qualify for the scholarship) payable after 2 weeks / 20 hours of taking the course on a trial basis. You get to evaluate the course for 20 hours before deciding whether you want to go through the entire syllabus with the highly experienced instructors who are strictly IIT alumni.
Instructors with 10 years Plus Industry Experience
In Coursera or edX, it is more common for Ph.D. professors than industry experienced professionals to teach the course. If you are good with American accents and next to zero instructor support, you will be able to learn a little bit about the scholastic side of your field. However, if you want to prepare for a 100K USD per year US data scientist job, you would be better off learning from professionals with industry experience. I am Not criticizing the Coursera instructors here, most have industry experience as well in the USA. However, if you want connections and contacts in the data science industry in India and the US, you might be a bit lost in the vast numbers of student who take those courses. Industry experience in instructors is rare in a MOOC and critically important to your landing a job.
Personalized Attention and Job Placement Guarantee
Dimensionless has a batch size of strictly not more than 25 per batch. This means that unlike other MOOCs with hundreds or thousands of students, every student in a class will get individual attention and training. This is the essence of what makes this company the market leader in this space. No other course provider has this restriction, which makes it certain that when you pay the money, you are 100% certain of completing your course, unlike all the other MOOCs out there. You are also given training for creating a data science portfolio, and how to prepare for data science interviews when you start applying to companies. The best part of this entire process is the 100% job placement guarantee.
If this has got your attention, and you are highly interested in data science, I encourage you to go to the following link to see more about the Data Science Using Python and R course, a strong foundation for a data science career: