Data scientists are one of the most hirable specialists today, but it’s not so easy to enter this profession without a “Projects” field in your resume. Furthermore, you need the experience to get the job, and you need the job to get the experience. Seems like a vicious circle, right? Also, the great advantage of data science projects is that each of them is a full-stack data science problem. Additionally, this means that you need to formulate the problem, design the solution, find the data, master the technology, build a machine learning model, evaluate the quality, and maybe wrap it into a simple UI. Hence, this is a more diverse approach than, for example, Kaggle competition or Coursera lessons.
Hence, in this blog, we will look at 10 projects to undertake in 2019 to learn data science and improve your understanding of different concepts.
Projects
1. Match Career Advice Questions with Professionals in the Field
Problem Statement: The U.S. has almost 500 students for every guidance counselor. Furthermore, youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn. Also, to date, 25,000 create profiles and opt-in to receive emails when a career question is a good fit for them. This is where your skills come in. Furthermore, to help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications for the volunteers seem to have the greatest impact on how many questions are answered.
Your objective: Develop a method to recommend relevant questions to the professionals who are most likely to answer them.
Problem Statement: In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. Also, the data for this competition is a slightly different version of the PatchCamelyon (PCam) benchmark dataset. PCam is highly interesting for both its size, simplicity to get started on, and approachability.
Your objective: Identify metastatic tissue in histopathologic scans of lymph node sections
Problem Statement: To assess the impact of climate change on Earth’s flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Furthermore, researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. Moreover, the first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with the creation of an algorithm that can identify a specific type of cactus in aerial imagery.
Your objective: Determine whether an image contains a columnar cactus
Problem Statement: In a world, where movies made an estimate of $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it’s “You had me at ‘Hello. In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Also, the data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. Furthermore, you can collect other publicly available data to use in your model predictions.
Your objective: Can you predict a movie’s worldwide box office revenue?
Problem Statement: An existential problem for any major website today is how to handle toxic and divisive content. Furthermore, Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
In this competition, you need to develop models that identify and flag insincere questions. Moreover, to date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.
Your objective: Detect toxic content to improve online conversations
Problem Statement: This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset. You are given 5 years of store-item sales data and asked to predict 3 months of sales for 50 different items at 10 different stores. What’s the best way to deal with seasonality? Should stores be modelled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost? Also, this is a great competition to explore different models and improve your skills in forecasting.
Your Objective: Predict 3 months of item sales at different stores
Problem Statement: This competition focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the competition at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles. Also, the sequential or temporal observations emerge in many key real-world problems, ranging from biological data, financial markets, weather forecasting, to audio and video processing. Moreover, the field of time series encapsulates many different problems, ranging from analysis and inference to classification and forecast. What can you do to help predict future views?
Problem Statement: Forecast future traffic to Wikipedia pages
Problem Statement: What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answer “all pursuits are under the immutable laws of physics” we’ll begrudgingly give you partial credit. Also, If you answer “people chose randomly for a transfer learning competition”, congratulations, we accept your answer and mark the question as solved.
In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from “extreme-tourism Antarctica”? Let’s find out.
Your objective: Predict tags from models trained on unrelated topics
Problem Statement: MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. Furthermore, in this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.
Your objective: Learn computer vision fundamentals with the famous MNIST data
Problem Statement: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Furthermore, this sensational tragedy shocked the international community and led to better safety regulations for ships. Also, one of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive.
Your objective: Predict survival on the Titanic and get familiar with ML basics
Data scientists are the no. 1 most promising job in America for 2019, according to a Thursday report from LinkedIn. Hence, this comes as no surprise: Data scientist topped Glassdoor’s list of Best Jobs in America for the past three years, with professionals in the field reporting high demand, high salaries, and high job satisfaction.
Also, with the increase in demand, employers are looking for more skills in modern day data scientists. Furthermore, a modern-day data scientist needs to be a good player in aspects like maths, programming, communication and problem-solving.
In this blog, we are going to explore if knowledge of mathematics is really necessary to become good data scientists. Also, we will try to explore ways, if any, through which one can become a good data scientist without learning maths.
What all it takes for a modern day Data Scientist
Data scientists continue to be in high demand, with companies in virtually every industry looking to get the most value from their burgeoning information resources. Additionally, this role is important, but the rising stars of the business are those savvy data scientists that have the ability to not only manipulate vast amounts of data with sophisticated statistical and visualization techniques but have a solid acumen from which they can derive forward-looking insights, Boyd says. Also, these insights help predict potential outcomes and mitigate potential threats to the business. Additionally, key skills of modern-day data scientists are as follows
1. Critical thinking
Data scientists need to be critical thinkers, to be able to apply the objective analysis of facts on a given topic or problem before formulating opinions or rendering judgments. Also, they need to understand the business problem or decision being made and be able to ‘model’ or ‘abstract’ what is critical to solving the problem, versus what is extraneous and can be ignored.
2. Coding
Top-notch data scientists know how to write code and are comfortable handling a variety of programming tasks. Furthermore, to be really successful as a data scientist, the programming skills need to comprise both computational aspects — dealing with large volumes of data, working with real-time data, cloud computing, unstructured data, as well as statistical aspects — [and] working with statistical models like regression, optimization, clustering, decision trees, random forests, etc.
3. Math
Data science is probably not a good career choice for people who don’t like or are not proficient at mathematics. Moreover, the data scientist whiz is one who excels at mathematics and statistics while having an ability to collaborate closely with line-of-business executives to communicate what is actually happening in the “black box” of complex equations in a manner that provides reassurance that the business can trust the outcomes and recommendations
4. Machine learning, deep learning, AI
Industries are moving extremely fast in these areas because of increased computing power, connectivity, and huge volumes of data being collected. A data scientist needs to stay in front of the curve in research, as well as understand what technology to apply when. Also, too many times a data scientist will apply something ‘sexy’ and new when the actual problem they are solving is much less complex.
Data scientists need to have a deep understanding of the problem to be solved, and the data itself will speak to what’s needed. Furthermore, being aware of the computational cost to the ecosystem, interpretability, latency, bandwidth, and other system boundary conditions — as well as the maturity of the customer — itself, helps the data scientist understand what technology to apply. That’s true as long as they understand the technology.
5. Communication
The importance of communication skills bears repeating. Virtually nothing in technology today is performed in a vacuum; there’s always some integration between systems, applications, data and people. Data science is no different, and being able to communicate with multiple stakeholders using data is a key attribute.
6. Data architecture
It is imperative that the data scientist understands what is happening to the data from inception to model to a business decision. Additionally, to not understand the architecture can have a serious impact on sample size inferences and assumptions, often leading to incorrect results and decisions.
As we have seen, mathematics is a crucial skill of a data scientist among many others. Agreed it is not everything that a data scientist may require. Hence, we will explore more on the usage of mathematics in data science. Also, this will help us to answer our question better!
Application of maths in data science and AI
Modelling a process (physical or informational) by probing the underlying dynamics
Constructing hypotheses
Rigorously estimating the quality of the data source
Quantifying the uncertainty around the data and predictions
Identifying the hidden pattern from the stream of information
Understanding the limitation of a model
Understanding mathematical proof and the abstract logic behind it
What all Maths You Must Know?
1. Linear algebra
You need to be familiar with linear algebra if you want to work in data science and machine learning because it helps deal with matrices — mathematical objects consisting of multiple numbers organised in a grid. Also, the data collected by a data scientist naturally comes in the form of a matrix — the data matrix — of n observations by p features, thus an n-by-p grid.
2. Probability theory
Probability theory — even the basic, not yet measure-theoretic probability theory — helps the data scientist deal with uncertainty and express it in models. Frequentists, Bayesian, and indeed quantum physicists argue to this day what probability really is (in many languages, such as Russian and Ukrainian, the word for probability comes from “having faith”), whereas pragmatists, such as Andrey Kolmogorov, shirk the question, postulate some axioms that describe how probability behaves (rather than what it is) and say: stop asking questions, just use the axioms.
3. Statistics
After probability theory, there comes statistics. As Ian Hacking remarked, “The quiet statisticians have changed our world — not by discovering new facts or technical developments, but by changing the ways that we reason, experiment, and form opinions”. Read Darrell Huff’s How to Lie with Statistics — if only to learn how to be truthful and how to recognise the truth — just as Moses learned “all the wisdom of the Egyptians” — in order to reject it.
4. Estimation theory
A particular branch of statistics — estimation theory — had been largely neglected in mathematical finance, at a high cost. It tells us how well we know a particular number: what is the error present in our estimates? How much of it is due to bias and how much due to variance?
Also, going beyond classical statistics, in machine learning, we want to minimise the error on new data — out-of-sample — rather than on the data that we have already seen — in-sample. As someone remarked, probably Niels Bohr or Piet Hein, “prediction is very difficult, especially about the future.”
5. Optimization theory
You can spend a lifetime studying this. Much of machine learning is about optimization — we want to find the weights that give the best (in optimisation speak, optimal) performance of a neural network on new data, so naturally, we have to optimise — perhaps with some form of regularisation. (And before you have calibrated that long short-term memory (LSTM) network — have you tried the basic linear regression on your data?)
What you miss on skipping Maths?
No in-depth knowledge of working of ML models
Inability to prove the correctness of your hypothesis
Prone to introducing bias and errors in your analysis
Inefficiency in math-heavy business problems
Some resources to learn maths online
We will divide the resources to 3 sections (Linear Algebra, Calculus, Statistics and probability), the list of resources will be in no particular order, resources are diversified between video tutorials, books, blogs, and online courses.
Linear Algebra
Used in machine learning (& deep learning) to understand how algorithms work under the hood. Basically, it’s all about vector/matrix/tensor operations, no black magic is involved!
Used in machine learning (&deep learning) to formulate the functions used to train algorithms to reach their objective, known by loss/cost/objective functions.
Linear algebra, calculus II, stats and probability are sufficient for understanding and handle 90% of machine learning models. Also, some areas and methods require special insights, for example, Bayesian and variational method require a calculus of variation, MCMC and Gibbs sample require advanced concepts of probability theory, information geometry and submanifolds learning to require differential geometry, kernel theory requires calculus III. Lately, it seems that even abstract algebra is playing a role.
Aditionally, not knowing maths may help you in reaching low-level positions in data science or solving some dummy projects. But in the long run, it will be maths only which will help you in scaling your career up!
Follow this link, if you are looking to learn more about data science online!
Technology has become the embedded component of applications and the defacto driver for growth in industries. With the advent of AI, new milestones are being achieved each day. We are moving towards an era of more and more integration, making it an indispensable mediator between systems and humans. The rapid strides taken by the mobile industry seems like an overwhelming convergence of multiple worlds. The innate ability of such systems to improve itself, strengthened by data analytics, IoT and AI have opened new frontiers. To reap the unbound merits of AI, software application vendors are integrating it into software applications.
In this blog, we will understand what exactly are these intelligent apps. What all does it require to make an intelligent app? Also, we will look into the real world applications of these intelligent apps.
What are Intelligent Applications?
So what exactly are intelligent apps? These are apps that not only know how to support key user decisions but also learn from user interactions. These apps aim to become even more relevant and valuable to these users.
In other words, intelligent apps are those that also learn and adapt and can even act on their own. Much like all of us, these apps learn and change behaviour. We are already seeing this at work. Have you noticed how e-commerce websites show you the right recommendations at the right time?
Intelligent apps are becoming a thing thanks to the strides being made in Artificial Intelligence (AI) and Machine Learning. Machine learning gives systems the ability to learn and improve from experience without being specifically programmed. There is an increase in the popularity of conversational systems and the growth of the Internet of Things. Therefore, we are seeing machine learning applied to more things in our everyday life.
Using AI algorithms, intelligent apps can study users’ behaviour and choices. Furthermore, it can sort through this data to use the relevant information to predict your needs and act on your behalf. For example, Smart Reply enables you to quickly respond to emails by giving you auto-generated replies. Productivity apps like Microsoft Office 365 and Google’s G Suite also use AI. Chatbots such as Meziuse machine learning to study user’s behaviour and provide them with choices they would like.
Features of Intelligent Applications
1. Data-driven
Intelligent apps combine and process multiple data sources — such as IoT sensors, beacons or user interactions — and turn an enormous quantity of numbers into valuable insights.
2. Contextual and relevant
Intelligent apps make much smarter use of a device’s features to proactively deliver highly relevant information and suggestions. Users will no longer have to go to their apps. Instead, the apps will come to them.
3. Continuously adapting
Through machine learning, intelligent apps continuously adapt and improve their output.
4. Action-oriented
By anticipating user behaviours with predictive analytics, smart applications deliver personalized and actionable suggestions.
5. Omnichannel
Progressive web applications are increasingly blurring the lines between native apps and mobile web applications.
Applications
1. Health Care Benefits
We are exploring AI/ML technology for health care. It can help doctors with diagnoses and tell when patients are deteriorating so medical intervention can occur sooner before the patient needs hospitalization. It’s a win-win for the healthcare industry, saving costs for both the hospitals and patients. The precision of machine learning can also detect diseases such as cancer sooner, thus saving lives.
2. Intelligent Conversational Interfaces
We are using machine learning and AI to build intelligent conversational chatbots and voice skills. These AI-driven conversational interfaces are answering questions from frequently asked questions and answers, helping users with concierge services in hotels, and to provide information about products for shopping. Advancements in the deep neural network or deep learning are making many of these AI and ML applications possible
3. Market Prediction
We are using AI in a number of traditional places like personalization, intuitive workflows, enhanced searching and product recommendations. More recently, we started baking AI into our go-to-market operations to be first to market by predicting the future. Or should I say, by “trying” to predict the future?
4. Customer Lifetime Value Modeling
Customer lifetime value models are among the most important for eCommerce business to employ. That’s because they can be used to identify, understand, and retain your company’s most valuable customers, whether that means the biggest spenders, the most loyal advocates of your brand, or both. These models predict the future revenue that an individual customer will bring to your business in a given period. With this information, you can focus your marketing efforts to encourage these customers to interact with your brand more often and even target your acquisition spend to attract new customers that are similar to your existing MVPs.
5. Churn Modeling
Customer churn modelling can help you identify which of your customers are likely to stop engaging with your business and why. The results of a churn model can range from churn risk scores for individual customers to drivers of churn ranked by importance. These outputs are essential components of an algorithmic retention strategy because they help optimize discount offers, email campaigns, or other targeted marketing initiatives that keep your high-value customer’s buying.
6. Dynamic Pricing
Dynamic pricing, also known as demand pricing, is the practice of flexible pricing items based on factors like the level of interest of the target customer, demand at the time of purchase, or whether the customer has engaged with a marketing campaign. This requires a lot of data about how different customers’ willingness to pay for a good or service changes across a variety of situations, but companies like airlines and ride-share services have successfully implemented dynamic price optimization strategies to maximize revenue.
7. Customer Segmentation
Data scientists do not rely on intuition to separate customers into groups. They use clustering and classification algorithms to group customers into personas based on specific variations among them. These personas account for customer differences across multiple dimensions such as demographics, browsing behaviour, and affinity. Connecting these traits to patterns of purchasing behaviour allows data-savvy companies to roll out highly personalized marketing campaigns. Additionally, these campaigns are more effective at boosting sales than generalized campaigns.
8. Image Classification
Image classification uses machine learning algorithms to assign a label from a fixed set of categories to an image that’s inputted. It has a wide range of business applications including modelling 3D construction plans based on 2D designs, social media photo tagging, informing medical diagnoses, and more. Deep learning methods such as neural networks are often used for image classification because they can most effectively identify relevant features of an image in the presence of potential complications like the variation in the point of view, illumination, scale, or volume of clutter in the image.
9. Recommendation Engines
Recommendation engines are another major way machine learning proves its business value. In fact, Netflix values the recommendation engine powering its content suggestions at $1 billion per year and Amazon says its system drives a 20–35% lift in sales annually. That’s because recommendation engines sift through large quantities of data to predict how likely any given customer is to purchase an item or enjoy a piece of content and then suggest those things to the user. The result is a customer experience that encourages better engagement and reduces churn.
Examples
1. Email Filters in Gmail
Google uses AI to ensure that nearly all of the email landing in your inbox is authentic. Their filters attempt to sort emails into the following categories like primary, social, promotions, updates, forums and spam. The program helps your emails get organized so you can find your way to important communications quicker.
2. LinkedIn
AI is used to help match candidates to jobs with the hopes of creating better employee-employer matches.
On its talent blog, LinkedIn explains that they use “deeper insights into the behaviour of applicants on LinkedIn” in order to “predict not just who would apply to your job, but who would get hired…”
3. Google Predictive Searches
When you begin typing a search term and Google makes recommendations for you to choose from, that’s AI in action. Predictive searches are based on data that Google collects about you, such as your location, age, and other personal details. Using AI, the search engine attempts to guess what you might be trying to find.
4. Tesla Smart Cars
Talking about the AI, there is no better and more prominent display of this technology than what smart car and drone manufacturers are doing with it. Just a few years back, using a fully automatic car was a dream, however, now companies like Tesla have made so much progress that we already have a fleet of semi-automatic cars on the road.
5. Online Ads Network(Facebook/Microsoft/Google)
One of the biggest users of artificial intelligence is the online ad industry which uses AI to not only track user statistics but also serve us ads based on those statistics. Without AI, the online ad industry will just fail as it would show random ads to users with no connection to their preferences what so ever. AI has become so successful in determining our interests and serving us ads that the global digital ad industry has crossed 250 billion US dollars with the industry projected to cross the 300 billion mark in 2019. So next time when you are going online and seeing ads or product recommendation, know that AI is impacting your life.
6. Amazon Product Recommendations
Amazon and other online retailers use AI to gather information about your preferences and buying habits. Then, they personalize your shopping experience by suggesting new products tailored to your habits.
When you search for an item such as “Bose headsets,” the search engine also shows related items that other people have purchased when searching for the same product.
Current trends and explorations
Intelligent things are poised to be one of the important trends that have the potential for ‘disruption’ and large-scale impact across industries. According to Gartner, the future will see the utilization of AI by almost all apps and services, making these apps discreet yet useful and intelligent mediators between systems and humans. AI will be incorporated into various systems and apps in some way and is poised to become the key enabler across a variety of services and software systems. As mentioned at the Google conference, very fast, we are moving from mobile-first to AI-first world.
It won’t be an exaggeration to say that all the new applications built in the coming years will be intelligent apps. These apps use machine learning and historical as well as real-time data to make smart decisions and deliver a highly personalized experience to the users. These apps combine predictive and prescriptive analytics, customer data, product insights, and operational vision with contemporary user-focused design and application development tools to create a highly impactful experience for users.
The intelligent apps undoubtedly have the potential to change the face of work and structure at companies in the coming years. According to Gartner’s prediction, companies will increasingly use and develop intelligent apps and utilize analytics and big data to enhance their business processes and offer top class customer experiences.
Summary
As companies are charting their digital transformation initiatives, they need to add intelligent apps to their blueprint. The development of the right intelligent apps needs to consider the new growth areas, internal and external data sources, real-time data acquisition, processing, and analysis and putting the right technology to use.
Intelligent apps are undoubtedly paving the way for speedier business decisions, better business results, greater efficiency of the workforce, and long-term gains for all — they just need to be utilized right. Companies which are diving in intelligent apps now will have a considerable competitive advantage in the near future.
Follow this link, if you are looking to learn more about data science online!
In computer science, Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees and where the target variable can take continuous values (typically real numbers) are called regression trees.
In this blog, we will use decision trees to solve the churn problem for a telephonic service provider, Telco! The goal of this blog is about building a decision tree classifier. Furthermore, we will look at some metrics in the end in order to evaluate the performance of our decision tree model.
What is Customer Churn?
Churn is defined slightly differently by each organization or product. Generally, the customers who stop using a product or service for a given period of time are referred to as churners. As a result, churn is one of the most important elements in the Key Performance Indicator (KPI) of a product or service. A full customer lifecycle analysis requires taking a look at retention rates in order to better understand the health of the business or product.
From a machine learning perspective, churn can be formulated as a binary classification problem. Although there are other approaches to churn prediction (for example, survival analysis), the most common solution is to label “churners” over a specific period of time as one class and users who stay engaged with the product as the complementary class.
What is a Decision Tree?
Decision trees, one of the simplest and yet most useful Machine Learning structures. Decision trees, as the name implies, are trees of decisions.
You have a question, usually a yes or no (binary; 2 options) question with two branches (yes and no) leading out of the tree. You can get more options than 2, but for this article, we’re only using 2 options.
Trees are weird in computer science. Instead of growing from a root upwards, they grow downwards. Think of it as an upside down tree.
The top-most item, in this example, “Am I hungry?” is called the root. It’s where everything starts. Branches are what we call each line. A leaf is everything that isn’t the root or a branch.
Trees are important in machine learning as not only do they let us visualise an algorithm, but they are a type of machine learning.
Telco Customer Churn Problem Statement
Teleco is looking to predict behaviour to retain customers for their product. You can analyze all relevant customer data and develop focused customer retention programs.
Each row in the data represents a customer, each column contains customer’s attributes described in the column Metadata.
The data set includes information about:
Customers who left within the last month — the column is called Churn
Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers — gender, age range, and if they have partners and dependents
1. Data Exploration
In this step, we will focus more on understanding data at hand. Before directly jumping to any modelling approach, one must understand the data at hand. We will look at different features/columns present in the data. We will also try to understand what every column relates too in the context of the Telco churn problem.
2. Data Preprocessing There are very rare chances that data available at hand is directly applicable for building ML models. There can be multiple issues with data before you can even start implementing any fancy algorithm. Some of the issues can be missing values, improper format, the presence of categorical variables etc. We need to handle such issues then only we can train machine learning models. In our case of churn prediction, you can actually see our approach of handling missing data and categorical variables.
3. Training a decision tree classifier In this section, we will fit a decision tree classifier on the available data. The classifier learns the underlying pattern present in the data and builds a rule-based decision tree for making predictions. In this section, we are focussing more on the implementation of the decision tree algorithm rather than the underlying math. But, we will cover some basic math behind it for your better understanding while implementing the decision tree algorithm. Also, you can read about the decision trees in detail here!
4. Model Evaluation In this section, we will use our classifier to predict whether a customer will stop using the services of Telco or not. We will be looking at some of the methods like confusion matrix, AUC value and ROC-curve etc to evaluate the performance of our decision tree classifier.
Implementation in Python
We start by importing all the basic libraries and the data for training the decision tree algorithm. We are using numpy and pandas for performing mathematical operations and managing data in form of tables respectively.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # this is used for the plot the graph
import seaborn as sns # used for plot interactive graph.
import warnings
warnings.filterwarnings("ignore")
from pylab import rcParams
%matplotlib inline
data = pd.read_csv('./telco-churn.csv')
You can see the columns in the data below
######################### RESULT ###################################
Customer ID contains no specific information about any customer hence we need to drop this column from our dataset
data=data.drop(['customerID'], axis=1)
Now, let us have a look at our target variable. Here the target variable is “Churn” i.e. if a customer is going to stop using the services of Telco or not. We can see the proportion of the people opting out of the Telco services (churners) through the following code. It builds a simple pie chart to show the distribution of two classes in the dataset.
# Data to plot for the % of target variable
labels =data['Churn'].value_counts(sort = True).index
sizes = data['Churn'].value_counts(sort = True)
colors = ["whitesmoke","red"]
explode = (0.1,0) # explode 1st slice
rcParams['figure.figsize'] = 8,8
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=270,)
plt.title('Percent of churn in customer')
plt.show()
W can see through the pie chart, that 26.5% of customers present in the dataset are actually the churners i.e the people who opt out of the Telco services. The two target variables have a class imbalance in this case. No is present 73.5% of the times whereas yes is present only 26.5% of the times. In order to have a good machine learning model, we need to have a balance of the classes say 50% of both the class or in general, equal representation of the classes in the target variable.
There are techniques to manage the imbalanced classes like SMOTE but this is beyond the scope of this article for now. Decision trees can handle some imbalance present in the data better than other algorithms hence, we will not be using any technique to obtain balanced classes.
Finally, we need to convert yes/no of the target variable to 0 or 1. Machine learning models do not understand categorical variables. They need numbers to implement the underlying algorithms. Hence, our task in the data processing stage is to convert all the categorical variables to numbers.
# Converting the target class to 1’s and 0's
data['Churn'] = data['Churn'].map(lambda s :1 if s =='Yes' else 0)
Preprocessing Gender
In this section, we are processing the gender column to convert the categorical values into values. Pandas provide a get_dummies() function which can help you to convert your categorical column to a numeric one. It does one hot encoding of the values present in the column.
data['gender'].unique()
g = sns.factorplot(y="Churn",x="gender",data=data,kind="bar" ,palette = "Pastel1")
data = pd.get_dummies(data=data, columns=['gender'])
In this case, males and females have the same proportion of churning hence this variable is not informative enough to tell if a customer is going to stop using the services of Telco or not
Preprocessing Partner column
Partner columns contain a single yes/no. When there are only two categorical values present, we can directly replace the categories with 1 and 0. In case of more than two categorical variables, it is better to switch to one hot encoding of the values.
data['Partner'] = data['Partner'].map(lambda s :1 if s =='Yes' else 0)
Processing dependents, phone service and paperless billing variables
Above columns also contain only two categorical values. The following code assigns these categories a numeric 1/0 based on the respective category
data['Dependents'] = data['Dependents'].map(lambda s :1 if s =='Yes' else 0)
data['PhoneService'] = data['PhoneService'].map(lambda s :1 if s =='Yes' else 0)
data['PaperlessBilling'] = data['PaperlessBilling'].map(lambda s :1 if s =='Yes' else 0)
Analysing the number of months a person has stayed in the company
Let us analyse the number of months variable in order to understand the churn of the customers. We are plotting a distribution chart using the seaborn library in python through the following code!
# tenure distibution
g = sns.kdeplot(data.tenure[(data["Churn"] == 0) ], color="Red", shade = True)
g = sns.kdeplot(data.tenure[(data["Churn"] == 1) ], ax =g, color="Blue", shade= True)
g.set_xlabel("tenure")
g.set_ylabel("Frequency")
plt.title('Distribution of tenure comparing with churn feature')
g = g.legend(["Not Churn","Churn"])
We can clearly observe that most numbers of customers leave before 20 months. Once a customer has stayed with Telco for more than 20 months, it is highly unlikely that he is going to stop using the services provided by the Telco
Handling the telephone lines variable
Converting telephone line variable into dummy values
data['MultipleLines'].replace('No phone service','No', inplace=True)
data['MultipleLines'] = data['MultipleLines'].map(lambda s :1 if s =='Yes' else 0)
Preprocessing Customer Service variable
Converting customer service variable into dummy values
data['Has_InternetService'] = data['InternetService'].map(lambda s :0 if s =='No' else 1)
data['Fiber_optic'] = data['InternetService'].map(lambda s :1 if s =='Fiber optic' else 0)
data['DSL'] = data['InternetService'].map(lambda s :1 if s =='DSL' else 0)
data['OnlineSecurity'] = data['OnlineSecurity'].map(lambda s :1 if s =='Yes' else 0)
data['OnlineBackup'] = data['OnlineBackup'].map(lambda s :1 if s =='Yes' else 0)
data['DeviceProtection'] = data['DeviceProtection'].map(lambda s :1 if s =='Yes' else 0)
data['TechSupport'] = data['TechSupport'].map(lambda s :1 if s =='Yes' else 0)
data['StreamingTV'] = data['StreamingTV'].map(lambda s :1 if s =='Yes' else 0)
data['StreamingMovies'] = data['StreamingMovies'].map(lambda s :1 if s =='Yes' else 0)
Preprocessing payment and contract variables
data = pd.get_dummies(data=data, columns=['PaymentMethod'])
data = pd.get_dummies(data=data, columns=['Contract'])
Analysing monthly and total charges
In this section, we will analyse the total and monthly charges column to understand their effect on the churning of the customers. Below code outputs a box plot based on the monthly charges between the customers who have left using the services and those who have not.
g = sns.factorplot(x="Churn", y = "MonthlyCharges",data = data, kind="box", palette = "Pastel1")
The high monthly charge is one cause for customers opting out of the services. The box plot also suggests that that loyal customers are those who have a low monthly charge.
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])
g = sns.factorplot(y="TotalCharges",x="Churn",data=data,kind="boxen", palette = "Pastel2")
Modeling
In this section, we are building our decision tree classifier. Decision tree algorithm has been implemented in the sklearn package and it provides an interface to invoke that algorithm.
We first separate out the target variables from the rest of our features present in the data. Following code performs this task
After that, we import the decision tree algorithm from the sklearn package. We pass the training data and the target data separately so that the algorithm can understand the input and create a rule-based mapping to the output in the form of a decision tree.
from sklearn import tree
# Initialize our decision tree object
classification_tree = tree.DecisionTreeClassifier(max_depth=4)
# Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(X_train, Y_train)
We can also visualise the trained decision tree in the previous step using the below code. Branches represent the actions while the internal nodes represent a decision. Leaf nodes represent the final target variables i.e whether a customer is churning or not (1/0)
Before we move on to the evaluation of our algorithm, let us quickly test the accuracy of our model by making some predictions. Below code makes some predictions on the test data. We are using cross-validation here in order to avoid bias while testing the model for accuracy.
One great aspect of rule-based learning is that we can actually visualise how important a given feature is towards predicting our target variable. Below code forms a chart showing the relative importance of the features present in the data towards the prediction of our target churn class.
In order to evaluate our classification model, we are using the confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.
You can read more about the confusion matrix here.
Following code generates the ROC curve for the predictions performed by our decision tree classifier. More the area under the ROC curve, better the prediction capability of the decision tree classifier.
from sklearn.metrics import roc_curve
# Calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = roc_curve(Y_train, probs[:,1])
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Below code makes a precision vs recall curve on the basis of the predictions. A precision-recall curve is a better indicator of the model performance than the ROC curve in case of imbalance in the target variable.
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(Y_train, probs[:,1])
# plot no skill
pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
# plot the roc curve for the model
pyplot.plot(recall, precision, marker='.')
Summary
In this blog, we covered building a decision tree classifier for a sample use case. Decision trees can be a great platform to plant your footsteps in the world of machine learning as they are easier to implement and interpret. Also, we made some predictions using our decision tree classifier and evaluated its performance. The scope of this article was to familiarise the readers with decision trees and it’s an implementation for a use case!
For the next steps, you can also look at the Random Forest algorithm, which is nothing but a collection of different decision trees together making predictions. You can read more about the random forest here.
Follow this link, if you are looking to learn more about data science online!
With the growth of Data science in recent years, we have seen a growth in the development of the tools for it. R and Python have been steady languages used by people worldwide. But before R and Python, there was only one key player and it was MATLAB. MATLAB is still in usage in most of the academics areas and mostly all the researchers throughout the world use MATLAB.
In this blog, we will look at the reasons why MATLAB is a good contender to R and Python for Data science. Furthermore, we will discuss different courses which offer data science with MATLAB.
What is MATLAB?
MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation.
It is a programming platform, specifically for engineers and scientists. The heart of MATLAB is the MATLAB language, a matrix-based language allowing the most natural expression of computational mathematics.
Typical uses include:
Math and computation
Algorithm development
Modelling, simulation, and prototyping
Data analysis, exploration, and visualization
Scientific and engineering graphics
Application development, including Graphical User Interface building
The language, apps, and built-in math functions enable you to quickly explore multiple approaches to arrive at a solution. MATLAB lets you take your ideas from research to production by deploying to enterprise applications and embedded devices, as well as integrating with Simulink® and Model-Based Design.
Features of MATLAB
Following are the basic features of MATLAB −
It is a high-level language for numerical computation, visualization and application development
Provides an interactive environment for iterative exploration, design and problem-solving.
Holds a vast library of mathematical functions for linear algebra, statistics, Fourier analysis, filtering, optimization, numerical integration and solving ordinary differential equations.
It provides built-in graphics for visualizing data and tools for creating custom plots.
MATLAB’s programming interface gives development tools for improving code quality maintainability and maximizing performance.
It provides tools for building applications with custom graphical interfaces.
It provides functions for integrating MATLAB based algorithms with external applications and languages such as C, Java, .NET and Microsoft Excel.
Why Use MATLAB in Data Science?
Physical-world data: MATLAB has native support for the sensor, image, video, telemetry, binary, and other real-time formats. Explore this data using MATLAB MapReduce functionality for Hadoop, and by connecting interfaces to ODBC/JDBC databases.
Machine learning, neural networks, statistics, and beyond: MATLAB offers a full set of statistics and machine learning functionality, plus advanced methods such as nonlinear optimization, system identification, and thousands of prebuilt algorithms for image and video processing, financial modelling, control system design.
High-speed processing of large data sets. MATLAB’s numeric routines scale directly to parallel processing on clusters and cloud.
Online and real-time deployment: MATLAB integrates into enterprise systems, clusters, and clouds, and can be targeted to real-time embedded hardware.
Also, MATLAB finds its features available for the entire data science problem-solving journey. Let us have a look at how MATLAB fits in every stage of a data science problem pipeline
1. Accessing and Exploring Data
The first step in performing data analytics is to access the wealth of available data to explore patterns and develop deeper insights. From a single integrated environment, MATLAB helps you access data from a wide variety of sources and formats like different databases, CSV, audio, video etc
2. Preprocessing and Data Munging
When working with data from numerous sources and repositories, engineers and scientists need to preprocess and prepare data before developing predictive models. For example, data might have missing values or erroneous values, or it might use different timestamp formats. MATLAB helps you simplify what might otherwise be time-consuming tasks such as cleaning data, handling missing data, removing noise from the data, dimensionality reduction, feature extraction and domain analysis such as videos/audios.
3. Developing Predictive Models
Prototype and build predictive models directly from data to forecast and predict the probabilities of future outcomes. You can compare machine learning approaches such as logistic regression, classification trees, support vector machines, and ensemble methods, and use model refinement and reduction tools to create an accurate model that best captures the predictive power of your data. Use flexible tools for processing financial, signal, image, video, and mapping data to create analytics for a variety of fields within the same development environment.
4. Integrating Analytics with Systems
Integrate analytics developed in MATLAB into production IT environments without having to recode or create custom infrastructure. MATLAB analytics can be packaged as deployable components compatible with a wide range of development environments such as Java, Microsoft .NET, Excel, Python, and C/C++. You can share standalone MATLAB applications or run MATLAB analytics as a part of the web, database, desktop, and enterprise applications. For low latency and scalable production applications, you can manage MATLAB analytics running as a centralized service that is callable from many diverse applications.
People these days use MATLAB only when they need to create a quick prototype and then for doing trial and error for validating a fresh concept. The real implementation will never be made with MATLAB but with python, c++ or a similar language. In my opinion MATLAB and python (or python libs) serve for different purposes. Scripting is just one feature out of thousands of features in MATLAB but it is the only feature in python. People use both python and MATLAB scripts where in some other faculties people rely on only MATLAB toolboxes with zero scripting. Hence both python and MATLAB will exist in future but most probably the usage of MATLAB “outside” may be reduced. MATLAB will exist until we have a better alternative of it.
Summary
MATLAB provides a lot of inbuilt utilities which one can directly apply in data science. Furthermore, MATLAB today finds it’s heavy usage in the field of academics and research. Although languages like R and Python are dominating data science worldwide, they are no way near to the simplicity level which MATLAB has to offer. Also, MATLAB will go a long way in the field of data science in the years to come. Additionally, learning MATLAB will be a great bonus for those who are willing to pursue a career in research!
Also, follow this link, if you are looking to learn more about data science online!
Never thought that online trading could be so helpful because of so many scammers online until I met Miss Judith... Philpot who changed my life and that of my family. I invested $1000 and got $7,000 Within a week. she is an expert and also proven to be trustworthy and reliable. Contact her via: Whatsapp: +17327126738 Email:judithphilpot220@gmail.comread more
A very big thank you to you all sharing her good work as an expert in crypto and forex trade option. Thanks for... everything you have done for me, I trusted her and she delivered as promised. Investing $500 and got a profit of $5,500 in 7 working days, with her great skill in mining and trading in my wallet.
judith Philpot company line:... WhatsApp:+17327126738 Email:Judithphilpot220@gmail.comread more
Faculty knowledge is good but they didn't cover most of the topics which was mentioned in curriculum during online... session. Instead they provided recorded session for those.read more
Dimensionless is great place for you to begin exploring Data science under the guidance of experts. Both Himanshu and... Kushagra sir are excellent teachers as well as mentors,always available to help students and so are the HR and the faulty.Apart from the class timings as well, they have always made time to help and coach with any queries.I thank Dimensionless for helping me get a good starting point in Data science.read more
My experience with the data science course at Dimensionless has been extremely positive. The course was effectively... structured . The instructors were passionate and attentive to all students at every live sessions. I could balance the missed live sessions with recorded ones. I have greatly enjoyed the class and would highly recommend it to my friends and peers.
Special thanks to the entire team for all the personal attention they provide to query of each and every student.read more
It has been a great experience with Dimensionless . Especially from the support team , once you get enrolled , you... don't need to worry about anything , they keep updating each and everything. Teaching staffs are very supportive , even you don't know any thing you can ask without any hesitation and they are always ready to guide . Definitely it is a very good place to boost careerread more
The training experience has been really good! Specially the support after training!! HR team is really good. They keep... you posted on all the openings regularly since the time you join the course!! Overall a good experience!!read more
Dimensionless is the place where you can become a hero from zero in Data Science Field. I really would recommend to all... my fellow mates. The timings are proper, the teaching is awsome,the teachers are well my mentors now. All inclusive I would say that Kush Sir, Himanshu sir and Pranali Mam are the real backbones of Data Science Course who could teach you so well that even a person from non- Math background can learn it. The course material is the bonus of this course and also you will be getting the recordings of every session. I learnt a lot about data science and Now I find it easy because of these wonderful faculty who taught me. Also you will get the good placement assistance as well as resume bulding guidance from Venu Mam. I am glad that I joined dimensionless and also looking forward to start my journey in data science field. I want to thank Dimensionless because of their hard work and Presence it made it easy for me to restart my career. Thank you so much to all the Teachers in Dimensionless !read more
Dimensionless has great teaching staff they not only cover each and every topic but makes sure that every student gets... the topic crystal clear. They never hesitate to repeat same topic and if someone is still confused on it then special doubt clearing sessions are organised. HR is constantly busy sending us new openings in multiple companies from fresher to Experienced. I would really thank all the dimensionless team for showing such support and consistency in every thing.read more
I had great learning experience with Dimensionless. I am suggesting Dimensionless because of its great mentors... specially Kushagra and Himanshu. they don't move to next topic without clearing the concept.read more
My experience with Dimensionless has been very good. All the topics are very well taught and in-depth concepts are... covered. The best thing is that you can resolve your doubts quickly as its a live one on one teaching. The trainers are very friendly and make sure everyone's doubts are cleared. In fact, they have always happily helped me with my issues even though my course is completed.read more
I would highly recommend dimensionless as course design & coaches start from basics and provide you with a real-life... case study. Most important is efforts by all trainers to resolve every doubts and support helps make difficult topics easy..read more
Dimensionless is great platform to kick start your Data Science Studies. Even if you are not having programming skills... you will able to learn all the required skills in this class.All the faculties are well experienced which helped me alot. I would like to thanks Himanshu, Pranali , Kush for your great support. Thanks to Venu as well for sharing videos on timely basis...😊
I highly recommend dimensionless for data science training and I have also been completed my training in data science... with dimensionless. Dimensionless trainer have very good, highly skilled and excellent approach. I will convey all the best for their good work. Regards Avneetread more
After a thinking a lot finally I joined here in Dimensionless for DataScience course. The instructors are experienced &... friendly in nature. They listen patiently & care for each & every students's doubts & clarify those with day-to-day life examples. The course contents are good & the presentation skills are commendable. From a student's perspective they do not leave any concept untouched. The step by step approach of presenting is making a difficult concept easier. Both Himanshu & Kush are masters of presenting tough concepts as easy as possible. I would like to thank all instructors: Himanshu, Kush & Pranali.read more
When I start thinking about to learn Data Science, I was trying to find a course which can me a solid understanding of... Statistics and the Math behind ML algorithms. Then I have come across Dimensionless, I had a demo and went through all my Q&A, course curriculum and it has given me enough confidence to get started. I have been taught statistics by Kush and ML from Himanshu, I can confidently say the kind of stuff they deliver is In depth and with ease of understanding!read more
If you love playing with data & looking for a career change in Data science field ,then Dimensionless is the best... platform . It was a wonderful learning experience at dimensionless. The course contents are very well structured which covers from very basics to hardcore . Sessions are very interactive & every doubts were taken care of. Both the instructors Himanshu & kushagra are highly skilled, experienced,very patient & tries to explain the underlying concept in depth with n number of examples. Solving a number of case studies from different domains provides hands-on experience & will boost your confidence. Last but not the least HR staff (Venu) is very supportive & also helps in building your CV according to prior experience and industry requirements. I would love to be back here whenever i need any training in Data science further.read more
It was great learning experience with statistical machine learning using R and python. I had taken courses from... Coursera in past but attention to details on each concept along with hands on during live meeting no one can beat the dimensionless team.read more
I would say power packed content on Data Science through R and Python. If you aspire to indulge in these newer... technologies, you have come at right place. The faculties have real life industry experience, IIT grads, uses new technologies to give you classroom like experience. The whole team is highly motivated and they go extra mile to make your journey easier. I’m glad that I was introduced to this team one of my friends and I further highly recommend to all the aspiring Data Scientists.read more
It was an awesome experience while learning data science and machine learning concepts from dimensionless. The course... contents are very good and covers all the requirements for a data science course. Both the trainers Himanshu and Kushagra are excellent and pays personal attention to everyone in the session. thanks alot !!read more
Had a great experience with dimensionless.!! I attended the Data science with R course, and to my finding this... course is very well structured and covers all concepts and theories that form the base to step into a data science career. Infact better than most of the MOOCs. Excellent and dedicated faculties to guide you through the course and answer all your queries, and providing individual attention as much as possible.(which is really good). Also weekly assignments and its discussion helps a lot in understanding the concepts. Overall a great place to seek guidance and embark your journey towards data science.read more
Excellent study material and tutorials. The tutors knowledge of subjects are exceptional. The most effective part... of curriculum was impressive teaching style especially that of Himanshu. I would like to extend my thanks to Venu, who is very responsible in her jobread more
It was a very good experience learning Data Science with Dimensionless. The classes were very interactive and every... query/doubts of students were taken care of. Course structure had been framed in a very structured manner. Both the trainers possess in-depth knowledge of data science dimain with excellent teaching skills. The case studies given are from different domains so that we get all round exposure to use analytics in various fields. One of the best thing was other support(HR) staff available 24/7 to listen and help.I recommend data Science course from Dimensionless.read more
I was a part of 'Data Science using R' course. Overall experience was great and concepts of Machine Learning with R... were covered beautifully. The style of teaching of Himanshu and Kush was quite good and all topics were generally explained by giving some real world examples. The assignments and case studies were challenging and will give you exposure to the type of projects that Analytics companies actually work upon. Overall experience has been great and I would like to thank the entire Dimensionless team for helping me throughout this course. Best wishes for the future.read more
It was a great experience leaning data Science with Dimensionless .Online and interactive classes makes it easy to... learn inspite of busy schedule. Faculty were truly remarkable and support services to adhere queries and concerns were also very quick. Himanshu and Kush have tremendous knowledge of data science and have excellent teaching skills and are problem solving..Help in interviews preparations and Resume building...Overall a great learning platform. HR is excellent and very interactive. Everytime available over phone call, whatsapp, mails... Shares lots of job opportunities on the daily bases... guidance on resume building, interviews, jobs, companies!!!! They are just excellent!!!!! I would recommend everyone to learn Data science from Dimensionless only 😊read more
Being a part of IT industry for nearly 10 years, I have come across many trainings, organized internally or externally,... but I never had the trainers like Dimensionless has provided. Their pure dedication and diligence really hard to find. The kind of knowledge they possess is imperative. Sometimes trainers do have knowledge but they lack in explaining them. Dimensionless Trainers can give you ‘N’ number of examples to explain each and every small topic, which shows their amazing teaching skills and In-Depth knowledge of the subject. Himanshu and Kush provides you the personal touch whenever you need. They always listen to your problems and try to resolve them devotionally.
I am glad to be a part of Dimensionless and will always come back whenever I need any specific training in Data Science. I recommend this to everyone who is looking for Data Science career as an alternative.
All the best guys, wish you all the success!!read more