Now, in theory, it is possible to become a data scientist, without paying a dime. What we want to do in this article is to list out the best of the best options to learn what you need to know to become a data scientist. Many articles offer 4-5 courses under each heading. What I have done is to search through the Internet covering all free courses and choose the single best course for each topic.
These courses have been carefully curated and offer the best possible option if you’re learning for free. However – there’s a caveat. An interesting twist to this entire story. Interested? Read on! And please – make sure you complete the full article.
Topics For A Data Scientist Course
The basic topics that a data scientist needs to know are:
Machine Learning Theory and Applications
Statistics & Probability
Calculus Basics (short)
Machine Learning in Python
Machine Learning in R
So let’s get to it. Here is the list of the best possible options to learn every one of these topics, carefully selected and curated.
Machine Learning – Stanford University – Andrew Ng (audit option)
The world-famous course for machine learning with the highest rating of all the MOOCs in Coursera, from Andrew Ng, a giant in the ML field and now famous worldwide as an online instructor. Uses MATLAB/Octave. From the website:
This course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include:
(ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning)
(iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI)
The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.
This course is extremely effective and has many benefits. However, you will need high levels of self-discipline and self-motivation. Statistics show that90% of those who sign up for a MOOC without a classroom or group environment never complete the course.
Learn Python The Hard Way – Zed Shaw – Free Online Access
You may ask me, why do I want to learn the hard way? Shouldn’t we learn the smart way and not the hard way? Don’t worry. This ebook, online course, and web site is a highly popular way to learn Python. Ok, so it says the hard way. Well, the only way to learn how to code is to practice what you have learned. This course integrates practice with learning. Other Python books you have to take the initiative to practice.
Here, this book shows you what to practice, how to practice. There is only one con here – although this is the best self-driven method, most people will not complete all of it. The main reason is that there is no external instructor for supervision and a group environment to motivate you. However, if you want to learn Python by yourself, then this is the best way. But not the optimal one, as you will see at the end of this article since the cost of the book is 30$ USD (2100 INR approx).
Interactive R and Data Science Programming – SwiRl
Swirlstats is a wonderful tool to learn R and data science scripting in R interactively and intuitively by teaching you R commands from within the R console. This might seem like a very simple tool, but as you use it, you will notice its elegance in teaching you literally how to express yourselves in R and the finer nuances of the language and integration with the console and tidyverse. This is a powerful method of learning R and what is more, it is also a lot of fun!
KhanAcademy is a free non-profit organization on a mission – they want to provide a world-class education to you regardless of where you may be in the world. And they’re doing a fantastic job! This course has been covered in several very high profile blogs and Quora posts as the best online course for statistics – period. What is more, it is extremely high quality and suitable for beginners – and – free! This organization is doing wonderful work. More power to them!
Mathematics for Data Science
Now the basic mathematics for data science content includes linear algebra, single-variable, discrete mathematics, and multivariable calculus (selected topics) and basics of differential equations. Now you could take all of these topics separately in KhanAcademy and that is a good option for Linear Algebra and Multivariate Calculus (in addition to Statistics and Probability).
For Linear Algebra, the link of what you need to know given in a course in KhanAcademy is given below:
These courses are completely free and very accessible to beginners.
This topic deserves a section to itself because discrete mathematics is the foundation of all computer science. There are a variety of options available to learn discrete mathematics, from ebooks to MOOCs, but today, we’ll focus on the best possible option. MIT (Massachusetts Institute of Technology) is known as one of the best colleges in the world and they have an Open information initiative known as MIT OpenCourseWare (MIT OCW). These are actual videos of the lectures taken by the students at one of the best engineering colleges in the world. You will benefit a lot if you follow the lectures at this link, they give all the basic concepts as clearly as possible. It’s a bit technical because this is open mostly for students at an advanced level. The link is given below:
It is also technical and from MIT but might be a little more accessible than the earlier option.
SQL (see-quel) or Structured Query Language is a must-learn if you are a data scientist. You will be working with a lot of databases, and SQL is the language used to access and generate data from database systems like Oracle and Microsoft SQL Server. The best free course I could find online is undoubtedly the one below:
We have covered Python, R, Machine Learning using MATLAB, Data Science with R (SwiRl teaches data science as well), Statistics, Probability, Linear Algebra, and Basic Calculus. Now we just need to get a course for Data Science with Python, and we are done! Now I looked at many options but was not satisfied. So instead of a course, I have provided you with a link to the scikit-learn documentation. Why?
Because that’s as good as an online course by itself. If you read through the main sections, get the code (Ctrl-X, Ctrl-V) and execute it in an Anaconda environment, and then play around with it, experiment, and observe and read up on what every line does, you will already know who to solve standard textbook problems. I recommend the following order:
This book is free to learn online. Get the data files, get the script files, use RStudio, and just as with Python, play, enjoy, experiment, execute, and explore. A little hard work will have you up and running with R in no time! But make sure you try as many code examples as possible. The libraries you can focus on are:
dplyr (data manipulation)
tidyr (data preprocessing “tidying”)
ggplot2 (graphical package)
purrr (functional toolkit)
readr (reading rectangular data files easily)
stringr (string manipulation)
To make it short, simple, and sweet, since we have already covered SQL and this content is for beginners, I recommend the following course:
This is a course on Udemy rated 4.2/5 and completely free. You will learn everything you need to work with Tableau (the most commonly used corporate-level visualization tool). This is an extremely important part of your skill set. You can make all the greatest analyses, but if you don’t visualize them and do it well, management will never buy into your machine learning solution, and neither will anyone who doesn’t know the technical details of ML (which is a large set of people on this planet). Visualization is important. Please make sure to learn the basics (at least!) of Tableau.
Kaggle Micro-Courses (Add-Ons – Short Concise Tutorials)
Kaggle is a wonderful site to practice your data science skills, but recently, they have added a set of hands-on courses to learn data science practicals. And, if I do say, so myself, it’s brilliant. Very nicely presented, superb examples, clear and concise explanations. And of course, you will cover more than we discussed earlier. Please, if you read through all the courses discussed so far in this article, and if you do just the courses at Kaggle.com, you will have spent your time wisely (though not optimally – as we shall see).
Now, if you are reading this article, you might have a fundamental question. This is a blog of a company that offers courses in data science, deep learning, and cloud computing. Why would we want to list all our competitors and publish it on our site? Isn’t that negative publicity?
Quite the opposite.
This is the caveat we were talking about.
Our course is a better solution than every single option given above!
We have nothing to hide.
And we have an absolutely brilliant top-class product.
Every option given above is a separate course by itself.
And they all suffer from a very prickly problem – you need to have excellent levels of discipline and self-motivation to complete just one of the courses above – let alone all ten.
You also have no classroom environment, no guidance for doubts and questions, and you need to know the basics about programming.
Our product is the most cost-effective option in the market for learning data science, as well as the most effective methodology for everyone – every course is conducted live in a classroom environment from the comfort of your home. You can work at a standard job, spend two hours on the internet every day, do extra work and reading on weekends, and become a professional data scientist in 6 months time.
We also have personalized GitHub project portfolio creation, management, and faculty guidance. Not to mention individual attention for each student.
And IITians for faculty who also happen to have 9+ years of industry experience.
So when we say that our product is the best on the market, we really mean it. Because of the live session teaching of the classes, which no other option on the Internet today has.
Am I kidding? Absolutely not. And you can get started with Dimensionless Technologies Data Science with Python and R course for just 70-odd USD. Which is the most cost-effective option on the market!
And unlike all the 10 courses and resources detailed above, instead of doing 10 courses, you just need to do one single course, with the extracted meat of all that you need to know as a data scientist. And yes, we cover:
Statistics & Probability
Machine Learning in Python
Machine Learning in R
GitHub Personal Project Portfolio Creation
Live Remote Daily Sessions
Experts with Industrial Experience
A Classroom Environment (to keep you motivated)
Individual Attention to Every Student
I hope this information has you seriously interested. Please sign up for the course – you will not regret it.
And we even have a two-week trial for you to experience the course for yourself.
Choose wisely and optimally.
Unleash the data scientist within!
An excellent general article on emerging state-of-the-art technology, AI, and blockchain:
Data science is one of the hottest topics in the 21st century because we are generating data at a rate which is much higher than what we can actually process. A lot of business and tech firms are now leveraging key benefits by harnessing the benefits of data science. Due to this, data science right now is really booming.
In this blog, we will deep dive into the world of machine learning. We will walk you through machine learning basics and have a look at the process of building an ML model. We will also build a random forest model in python to ease out the understanding process.
What is Machine Learning?
Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.
There are many different types of machine learning algorithms, with hundreds published each day, and they’re typically grouped by either learning style (i.e. supervised learning, unsupervised learning, semi-supervised learning) or by similarity in form or function (i.e. classification, regression, decision tree, clustering, deep learning, etc.). Regardless of learning style or function, all combinations of machine learning algorithms consist of the following:
Representation (a set of classifiers or the language that a computer understands)
Evaluation (aka objective/scoring function)
Optimization (search method; often the highest-scoring classifier, for example; there are both off-the-shelf and custom optimization methods used)
Steps for Building ML Model
Here is a step-by-step example of how a hospital might use machine learning to improve both patient outcomes and ROI:
1. Define Project Objectives
The first step of the life cycle is to identify an opportunity to tangibly improve operations, increase customer satisfaction, or otherwise create value. In the medical industry, discharged patients sometimes develop conditions that necessitate their return to the hospital. In addition to being dangerous and troublesome for the patient, these readmissions mean the hospital will spend additional time and resources on treating patients for the second time.
2. Acquire and Explore Data
The next step is to collect and prepare all of the relevant data for use in machine learning. This means consulting medical domain experts to determine what data might be relevant in predicting readmission rates, gathering that data from historical patient records, and getting it into a format suitable for analysis, most likely into a flat file format such as a .csv.
3. Model Data
In order to gain insights from your data with machine learning, you have to determine your target variable, the factor of which you are trying to gain a deeper understanding. In this case, the hospital will choose “readmitted,” which is included as a feature in its historical dataset during data collection. Then, they will run machine learning algorithms on the dataset that build models that learn by example from the historical data. Finally, the hospital runs the trained models on data the model hasn’t been trained on to forecast whether new patients are likely to be readmitted, allowing it to make better patient care decisions.
4. Interpret and Communicate
One of the most difficult tasks of machine learning projects is explaining a model’s outcomes to those without any data science background, particularly in highly regulated industries such as healthcare. Traditionally, machine learning has been thought of as a “black box” because of how difficult it is to interpret insights and communicate their value to stakeholders and regulatory bodies alike. The more interpretable your model, the easier it will be to meet regulatory requirements and communicate its value to management and other key stakeholders.
5. Implement, Document, and Maintain
The final step is to implement, document, and maintain the data science project so the hospital can continue to leverage and improve upon its models. Model deployment often poses a problem because of the coding and data science experience it requires, and the time-to-implementation from the beginning of the cycle using traditional data science methods is prohibitively long.
A certain car manufacturing company X is looking to target its customers for their particular car model. Customers are identified by their age, salary, and Gender. The organisation wants to identify or predict which customers will affect the sales of their new car and actually purchase it.
We have a purchased column here which holds two values i.e 0 and 1. 0 indicates that the car has not been purchased by a certain individual. 1 indicates the sale of the car.
Importing the Required Libraries
You need to import all the required libraries first which will ease the model building parts for us. We are using keras to build our random forest model. We are using the matplotlib library to plot the charts and graphs and visualise results. In the end, we are also importing functions from the sklearn module which can help us in splitting our data into training and testing parts
# Importing the libraries
import numpy asnp
import matplotlib.pyplot asplt
import pandas aspd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
Loading the Dataset
In this step, you need to load your dataset in the memory. After that, we separate out the dependent and the independent variables for the training of our classifier. In most of the cases, you need to separate the dependent and he the independent variables
# Importing the dataset
Splitting the Dataset to Form Training and Test Data
In all the cases, you need to make some partitions in your data. A major chunk of your data acts as a training set and a smaller chunk acts as a test set. There are no clearly defined criteria on the proportion of the training and the test set. But most people follow 70–30 or 75–25 rule where a larger chunk is your training set. We train the data on the training set and test it on the test set. This process is known as validation. The prime idea behind this purpose is that one needs to gauge the performance of the model on the data which model has never seen before. In the real-world scenarios, the model will be predicting values on the unseen data. Furthermore, techniques like validation help us in avoiding overfitting or underfitting the model.
Overfitting refers to the case when our model has learnt all about the specific data on which it trained. It will work well on the training data but will have poor accuracy for any unseen data point. Overfitting is like your model is very specific to the data it has and has no generality. Similarly, underfitting is the case where your model is very general and is not able to predict well for your specific use-case. To achieve the best model accuracy, you need to strike a perfect balance between overfitting and under-fitting.
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
In this case, we are fitting our model with the training data. We are using the random forest model exposed by the sklearn package in python. Ultimately, we pass the dependent and independent features separately through which our model makes an internal mapping between them using mathematical coefficients.
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
In this part, we are passing unseen values to our model on which it is making predictions. We use a confusion matrix to derive metrics like accuracy, precision, and recall for our model. These metrics help us to understand the performance of the model.
# Predicting the Test set results
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
Visualising the Predictions
Additionally, we have made an attempt to visualise the predictions of our model using the below code.
Hence, in this Machine Learning Tutorial, we studied the basics of ML. Earlier machine learning was the theory that computers can learn without being programmed to perform specific tasks. But now, the researchers interested in artificial intelligence wanted to see if computers could learn from data. They learn from previous computations to produce reliable decisions and results. It’s a science that’s not new — but one that’s gaining fresh momentum.
Follow this link, if you are looking to learn more about data science online!
We discussed earlier in Part 1 of Blockchain Applications of Data Science on this blog how the world could be made to become much more profitable for not just a select set of the super-rich but also to the common man, to anyone who participates in creating a digitally trackable product. We discussed how large scale adoption of cryptocurrencies and blockchain technology worldwide could herald a change in the economic demography of the world that could last for generations to come. In this article, we discuss how AI and data science can be used to tackle one of the most pressing questions of the blockchain revolution – how to model the future price of the Bitcoin cryptocurrency for trading for massive profit.
But first, we take a short detour to explore another aspect of cryptocurrency that is not commonly talked about. Looking at the state of the world right now, it should be discussed more and I feel compelled to share this information with you before we skip to the juicy part about cryptocurrency price forecasting.
The Environmental Impact of Cryptocurrency Mining
Now, two fundamental assumptions. I assume you’ve read Part 1, which contained a link to a visual guide of how cryptocurrencies work. In case you missed the latter, here’s a link for you to check again.
The following articles speak about the impact of cryptocurrency mining on the environment. Read at least one partially at the very least so that you will understand as we progress with this article:
So cryptocurrency mining involves a huge wastage of computational resources, energy, and enough electrical power to run an entire country. This is mainly due to the model of the Proof-of-Work PoW mining system used by Bitcoin. For more, see the following article..
In PoW mining, miners compete against each other in a desperate race to see who can find the solution to a mathematical hashing problem the quickest. And in every race, only one miner is rewarded with the Bitcoin value.
In a significant step forward, Vitalin Buterik’s Ethereum cryptocurrency has shifted to Proof-of-Stake based (PoS) mining system. This makes the mining process significantly less energy intensive than PoW. Some claim the energy savings may be 99.9% more efficient than PoW. Whatever the statistics may be, a PoS based mining process is a big step forward and may completely change the way the environmentalists feel about cryptocurrencies.
So by shifting to PoS mining we can save a huge amount of energy. That is a caveat you need to remember and be aware about because Bitcoin uses PoW mining only. It would be a dream come true for an environmentalist if Bitcoin could shift to PoS mining. Let’s hope and pray that it happens.
Now back to our main topic.
Use AI and Data Science to Predict Future Prices of Cryptocurrency – Including the Burst of the Bitcoin Bubble
What is a blockchain? A distributed database that is decentralized and has no central point of control. As on Feb 2018, the Bitcoin blockchain on a full node was 160-odd GB in size. Now in April 2019, it is 210 GB in size. So this is the question I am going to pose to you. Would it be possible to use the data in the blockchain distributed database to identify patterns and statistical invariances to invest minimally with maximum possible profit? Can we forecast and build models to predict the prices of cryptocurrency in the future using AI and data science? The answer is a definite yes.
You may wonder if applying data science techniques and statistical analysis can actually produce information that can help in forecasting the future price of bitcoin. I came across a remarkable kernel on www.Kaggle.com (a website for data scientists to practice problems and compete with each other in competitions) by a user with the handle wayward artisan and the profile name Tania J. I thought it was worth sharing since this is a statistical analysis of the rise and the fall of the bitcoin bubble vividly illustrating how statistical methods helped this user to forecast the future price of bitcoin. The entire kernel is very large and interesting, please do visit it at the link given below. Just the start and the middle section of the kernel is given here because of space considerations and intellectual property considerations as well.
A Kaggle Kernel That Modelled the Bitcoin Bubble Burst Within Reasonable Error Limits
This following kernel uses cryptocurrency financial data scraped from www.coinmarketcap.com. It is a sobering example of how AI predictions actually predicted the collapse of the bitcoin bubble, prompting as many sellers to sell as they did. Coming across this kernel is one of the main motivations to write this article. I have omitted a lot of details, especially building the model and analyzing its accuracy. I just wanted to show that it was possible.
The dataset is available at the following link as a csv file in Microsoft Excel:
We focus on one of the middle sections with the first ARIMA model with SARIMAX (do look up Wikipedia and Google Search to learn about ARIMA and SARIMAX) which does the actual prediction at the time that the bitcoin bubble burst (only a subset of the code is shown). Visit the Kaggle kernel page on the link below this extract to get the entire code:
<data analysis and model analysis code section not shown here for brevity>
This code and the code earlier in the kernel (not shown for the sake of brevity) that built the model for accuracy gave the following predictions as output:
What do we learn? Surprisingly, the model captures the Bitcoin bubble burst with a remarkably accurate prediction (error levels ~ 10%)!
So, does AI and data science have anything to do with blockchain technology and cryptocurrency? The answer is a resounding, yes. Expect data science, statistical analysis, neural networks, and probability model distributions to play a heavy part when you want to forecast cryptocurrency prices.
For all the data science students out there, I am going to include one more screen from the same kernel on Kaggle (link):
The reason I want to show you this screen is that the terms and statistical lingo like kurtosis and heteroskedasticity are statistics concepts that you need to master in order to conduct forecasts like this, the main reason being to analyze the accuracy of the model you have constructed. The output window is given below:
Data Science, Machine Learning, Deep Learning, and Artificial Intelligence are some of the most heard about buzzwords in the modern analytical eco-space. The exponential growth of technology in this regard has simplified our lives and made us more machine dependent. The astonishing hype surrounding such technologies has prompted professionals from various disciples to hop on to the ship and consider analytics as their career option.
To master Data Science or Artificial Intelligence in that regard, one needs a myriad of skills which includes Programming, Mathematics, Statistics, Probability, Machine Learning, and also Deep Learning. The most sort after languages for programming in Data Science is Python, and R with the former being regarded as the holy grail of the programming world because of its functionality, flexibility, community, and others.
Python is comparatively easy to master but given its importance, it has various usages which demand certain specific areas to be mastered more efficiently compared to others. In this blog, we would learn about the virtual environments in Python and how they could be used.
What is a Python Virtual Environment?
A python virtual environment is a tool which ensures the separation of resources, and dependencies of a project by creating separate virtual environments for them.
As the virtual environments are just directories running a few scripts, it ensures the creation of an unlimited number of virtual environments.
Why Do We Need Virtual Environments?
Python has a rich list of modules, and packages used for different applications. However, often those packages would not come in the form of a standard library. Thus to ensure the fixation of a common bus, an application might need a version of a library specific to it.
It is often impossible for a single installation of python to include the requirements of every application. A conflict would be created when two applications would need two different versions of a particular module.
In our system, by default, each and every application would use the same directory for storing, and retrieval of the site-packages which are the third party libraries. This kind of situation may not be a cause of concern for system packages but certainly is for site-packages.
To eliminate such scenarios, Python has the facility of creating virtual environments which would separate the modules, and packages needed by each application in its own isolated environment. It would also have a standard self-contained directory consisting of the version of the python installed.
Imagine a scenario where both project A, and project B has their dependencies on the same project C. Now, at this points everything might seem fine, but when project A would need version v1.0.0 of Project C, and project B would need v2.0.0 of the project C, then a conflict would arise as it’s not possible for Python to differentiate between the two different versions in the directory called site-packages. As a result, both the versions would have the same name in the same directory.
This would lead to both the projects using the same version which would not be acceptable in many cases in real life. Thus Python Virtual Environments and the virtualenv/tools come to the rescue in those cases.
Creating a Virtual Environment
Python 3 already has the venv module for creating, and managing the virtual environments. For Python 2 users, the virtual environment could be created using the pip install virtualenv command. The venv module would ensure the installation of the last version of python available. In case of having multiple versions, the specific version like python3 could be selected for the creation.
The selection of directory is the first step as it is the place where the virtual environment would be located. Once the directory is decided, the command – python3 -m venv dimensionless-env could be executed on it to create a directory named dimensionless-env if it didn’t exist before, and would also create several directories inside it which includes the Python interpreter, various files, the standard library, and so on.
Once the virtual environment is created, it needs to be activated using the below commands –
dimensionless-env\Scripts\activate.bat in the Windows operating system.
source dimensionless-env/bin/activate in the Unix or Mac operating system. The bash shell uses this script. For csh, or fish shells, there are alternate scripts that could be used such as activate.csh, and activate.fish.
The shell’s prompt would display the virtual environment that’s being used after its being activated. It would also modify the Python environment to get the exact version of Python, and its installation.
The creation of the virtual environment allows you to do anything like installing, upgrading or removing packages using the pip command. Let’s search for the package called astronomy in our environment.
(dimensionless-env) $ pip search astronomy
There are several sub-commands in pip like install, freeze, etc. The latest version of any package could be installed by specifying its name.
Often, an application needs a specific version of a particular package to be installed which could be accomplished using the == sign to mention the version number as shown below.
Re-running the same command would do nothing but to install the latest version from here, either the version name could be specified or the ‘upgrade’ keyword could be used as shown below.
To uninstall a particular package pip uninstall package-name command is used. In order to get detailed information about a particular package, the pip show command is used. All the installed packages in the virtual environment could be displayed using the pip list command.
(dimensionless-env) $ pip list
The pip freeze command would also do the same task but in the format of pip install. Thus a generic notion is to put that in a requirments.txt file.
This requirements.txt file could be shipped and committed to allowing users making necessary installations using the install –r command.
What is Virtualenvwrapper?
Python virtual environments provide flexibility in the development, and the maintenance of our project as creating isolated environments allows projects to be separated from each other with the required dependencies for an individual project could be installed in that particular environment.
Though the virtual environments resolve the conflicts which arise due to the packages management, it is not completely perfect. Some problems often arise while managing the environment which is resolved by the virtualenvwrapper tool.
Some of the useful features of virtualenvwrapper are –
Organization – Virtualenvwrapper ensures all the virtual environments are organized in one particular location
Flexibility – It eases the process of creating, deleting, and copying environments by proving the respective methods for each.
Simplicity – There is a single command which allows switching between the environments.
The virtualenvwrapper could be installed using the pip install virtualenvwrapper command and then activating it either by running source or by executing the virtualenvwrapper.sh script. After the first installation using pip, the exact location of the virtualenvwrapper.sh would be known from the output of the installation.
How Python Virtual Environment is Used in Data Science?
The field of Data Science encompasses several methodologies which include Deep Learning as well. Deep Learning works with the principle of neural networks which is similar to the neurons in the human brain. Unlike the traditional Machine Learning algorithms, Deep Learning needs a huge volume of data, and vast computational power to make accurate predictions.
There are several Python libraries used for deep Learning such as TensorFlow, Keras, PyTorch, and so on. TensorFlow, which was created by Google is mostly used for Deep Learning operations. However, to work with TensorFlow in the Jupyter Notebook, we need to create a virtual environment first, and then install all the necessary packages inside that environment.
Once, you are into the Anaconda prompt, the conda create -n myenv python=3.6 command would create a new virtual environment known as myenv. The environment could be activated using the conda activate myenv command. The activation of the environment would let us install all the below necessary packages required to work TensorFlow.
conda install jupyter
conda install scipy
pip install –upgrade tensorflow
TensorFlow is used in applications like Object Detection, Image Processing, and so on.
Python is the most important programming language to master in the 21st century, and mastering it would open the door for numerous career opportunities. Its virtual environment feature allows to efficiently create, and manage a project, and its dependencies.
In this article, we learned that it’s not only about how virtual environments allows storing dependencies flawlessly but resolves various issues surrounding packaging, and the versioning in a project. The huge community of Python helps you find any tools needed for your project.
Dimensionless has several blogs and training to get started with Python Learning and Data Science in general.
Follow this link, if you are looking to learn more about data science online!
Additionally, if you are having an interest in learning Data Science, Learnonline Data Science Course to boost your career in Data Science.
Machine Learning is the word of the mouth for everyone involved in the analytics world. Gone are those days of the traditional manual approach of taking key business decisions. Machine Learning is the future and is here to stay.
However, the term Machine Learning is not a new one. It was there since the advent of computers but has grown tremendously in the last decade due to the massive amounts of data that’s getting generated, and the enormous computational power that modern-day system possesses.
Machine Learning is the art of Predictive Analytics where a system is trained on a set of data to learn patterns from it and then tested to make predictions on a new set of data. The more accurate the predictions are, the better the model performs. However, the metric for the accuracy of the model varies based on the domain one is working in.
Predictive Analytics has several usages in the modern world. It has been implemented in almost all sectors to make better business decisions and to stay ahead in the market. In this blog post, we would look into one of the key areas where Machine Learning has made its mark is the Customer Churn Prediction.
What is Customer Churn?
For any e-commerce business or businesses in which everything depends on the behavior of customers, retaining them is the number one priority for the organization. Customer churn is the process in which the customers stop using the products or services of a business.
Customer Churn or Customer Attrition is a better business strategy than acquiring the services of a new customer. Retaining the present customers is cost-effective, and a bit of effort could regain the trust that the customers might have lost on the services.
On the other hand, to get the service of the new customer, a business needs to spend a lot of time, and money on to the sales, and marketing department, more lucrative offers, and most importantly earning their trust. It would take more recourses to earn the trust of a new customer than to retain the existing one.
What are the Causes of Customer Churn?
There is a multitude of reasons why a customer could decide to stop using the services of a company. However, a couple of such reasons overwhelms others in the market.
Customer Service – This is one of the most important aspects on which business the growth of a business depends. Any customer could leave the services of a company if it’s poor or doesn’t live up to the expectations. A study showed that nearly ninety percent of the customer leave due to poor experience as modern era deems exceptional services, and experiences.
When a customer doesn’t receive such eye-catching experience from a business, it tends to lean towards its competitors leaving behind negative reviews in various social media from their past experiences which also stops potential new customers from using the service. Another study showed that almost fifty-nine percent of the people aged between twenty-five, and thirty share negative client experiences online.
Thus, poor customer experience not only results in the loss of a single customer but multiple customers as well which hinders the growth of the business in the process.
Onboarding Process – Whenever the business is looking to attract a new customer to use their service, it is necessary that the on-boarding process which includes timely follow-ups, regular communications, updates about new products, and so on are being followed, and maintained consistently over a period of time.
What are some of the Disadvantages of Customer Churn?
A customer’s lifetime value and the growth of the business maintains a direct relationship between each other i.e., more chances that the customer would churn, the less is the potential for the business to grow. Even a good marketing strategy would not save a business if it continues to lose customers at regular intervals due to other reasons and spend more money on acquiring new customers who are not guaranteed to be loyal.
There is a lot of debate surrounding customer churn and acquiring new customers because the former is much more cost-effective and ensures business growth. Thus companies spend almost seven times more effort, and time to retain old customers than acquire a new one. The global value of a customer lost is nearly two hundred, and forty-three dollars which makes churning a costly affair for any business.
What Strategies could a Business Undertake to prevent Customer Churn?
Customer Churn hinders or prevents the growth of an organization. Thus it is necessary that any business or organization has a flexible system in place to prevent the churn of customers and ensure its growth in the process. The companies need to find the metrics to identify the probability of a customer leaving, and chalk out strategies for improvement of its services, and products.
The calculation of the possibility of the customer churning varies from one business to another. There is no one fixed methodology that every organization uses to prevent churn. A churn rate could represent a variety of things such as – the total number of customers lost, the cost of the business loss, what percentage of the customers left in comparison to the total customer count of the organization, and so on.
Improving the customer experience should be the first strategy on the agenda of any business to prevent churn. Apart from that, marinating customer loyalty by providing better, personalized services is another important step one could undertake. Additionally, many organizations sent out customer survey time, and again to keep track of their customer experiences, and also seek reasons from them who have already churned.
A company should understand and learn about its customers beforehand. The amount of data that’s available all over the internet is sufficient to analyze a customer’s behavior, his likes, and dislikes, and improve the services based on their needs. All these measures, if taken with utmost care could help a business prevent its customers from churning.
Telecom Customer Churn Prediction
Previously, we learned how Predictive Analytics is being employed by various businesses to prevent any event from occurring and reduce the chances of losing by putting the right system in place. As customer churn is a global issue, we would now see how Machine Learning could be used to predict the customer churn of a telecom company.
Gender – Determines whether the customer is a male or a female.
Senior Citizen – A binary variable with values as 1 for senior citizen and 0 for not a senior citizen.
Partner – Values as ‘yes’ or ‘no based on whether the customer has a partner.
Dependents – Values as ‘yes’ or ‘no’ based on whether the customer has dependents.
Tenure – A numerical feature which gives the total number of months the customer stayed with the company.
Phone Service – Values as ‘yes’ or ‘no’ based on whether the customer has phone service.
Multiple Lines – Values as ‘yes’ or ‘no’ based on whether the customer has multiple lines.
Internet Service – The internet service providers the customer has. The value is ‘No’ if the customer doesn’t have internet service.
Online Security – Values as ‘yes’ or ‘no’ based on whether the customer has online security.
Online Backup – Values as ‘yes’ or ‘no’ based on whether the customer has online backup.
Device Protection – Values as ‘yes’ or ‘no’ based on whether the customer has device protection.
Tech Support – Values as ‘yes’ or ‘no’ based on whether the customer has tech support.
Streaming TV – Values as ‘yes’ or ‘no’ based on whether the customer has a streaming TV.
Streaming Movies – Values as ‘yes’ or ‘no’ based on whether the customer has streaming movies.
Contract – This column gives the term of the contract for the customer which could be a year, two years or month-to-month.
Paperless Billing – Values as ‘yes’ or ‘no’ based on whether the customer has a paperless billing.
Payment Method – It gives the payment method used by the customer which could be a credit card, Bank Transfer, Mailed Check, or Electronic Check.
Monthly Charges – This is the total charge incurred by the customer monthly.
Total Charges – The value of the total amount charged.
Churn – This is our target variable which needs to be predicted. Its values are either Yes (if the customer has churned), or No (if the customer is still with the company)
The following steps are the walkthrough of the code which we have written to predict the customer churn.
First, we have imported all the necessary libraries we would need to proceed further in our code
Just to get an idea of how our data looks likes, we have read the CSV file and printed out the first five rows of our data in the form of a data frame
Once, the data is read, some pre-processing needed to be done to check for null, outliers, and so on
Once the pre-processing is done, the next step is to get the relevant features to use in our model for the prediction. For that, we have done some data visualization to find out the relevancy of each predictor variables
After the data has been plotted, it is observed that Gender doesn’t have much influence on churn, whereas senior citizens are more likely to leave the company. Also, Phone Service has more influence on Churn than Multiple Lines
A model cannot take categorical data as input, hence those features are encoded into numbers to be used in our prediction
Based on our observation, we have taken the features which have more influence on churn prediction
The data is scaled, and split it into train and test set
We have fitted the Random Forest classifier to our new scaled data
Predicted the result and using the confusion matrix as the metric for our model
The model gives us (1155 + 190 = 1345) correct predictions and (273 + 143 = 416) incorrect predictions
The entire code could be found in this GitHub link
We have built a basic Random Forest Classifier model to predict the Customer Churn for a telecom company. The model could be improved with further manipulation of the parameters of the classifier and also by applying different algorithms.