Top 5 Careers in Data Science You Need to Know About

Top 5 Careers in Data Science You Need to Know About

 

Reports suggest that around 2.5 quintillion bytes of data are generated every single day. As the online usage growth increases at a tremendous rate, there is a need for immediate Data Science professionals who can clean the data, obtain insights from it, visualize it, train model and eventually come up with solutions using Big data for the betterment of the world.

By 2020, experts predict that there will be more than 2.7 million data science and analytics jobs openings. Having a glimpse of the entire Data Science pipeline, it is definitely tiresome for a single human to perform and at the same time excel at all the levels. Hence, Data Science has a plethora of career options that require a spectrum set of skill sets.

Let us explore the top 5 data science career options in 2019 (In no particular order).

 

1. Data Scientist

Data Scientist is one of the ‘high demand’ job roles. The day to day responsibilities involves the examination of big data. As a result of the analysis of the big data, they also actively perform data cleaning and organize the big data. They are well aware of the machine learning algorithms and understand when to use the appropriate algorithm. During the due course of data analysis and the outcome of machine learning models, patterns are identified in order to solve the business statement.

The reason why this role is so crucial in any organisation is that the company tends to take business decisions with the help of the insights discovered by the Data Scientist to have an edge over the company’s competitors. It is to be noted that the Data Scientist role is inclined more towards the technical domain. As the role demands a wide range of skill set, Data Scientists are one among the highest paid jobs.

 

Core Skills of a Data Scientist

  1. Communication
  2. Business Awareness
  3. Database and querying
  4. Data warehousing solutions
  5. Data visualization
  6. Machine learning algorithms

 

2. Business Intelligence Developer

BI Developer is a job role inclined more towards the Non-Technical domain but has a fair share of Technical responsibilities as well (if required) as a part of their day to day responsibilities. BI developers are responsible for creating and implementing business policies as a result of the insights obtained from the Technical team.

Apart from being a policymaker involving the usage of dedicated (or custom) Business Intelligence analytics tools, they will also have a fair share of coding in order to explore the dataset, present the insights of the dataset in a non-verbal manner. They help in bridging the gap between the technical team that works with the deepest technical understanding and the clients that want the results in the most non-technical manner. They are expected to generate reports from the insights and make it ‘less technical’ for others in the organisation. It is noted that the BI Developers have a deep understanding of Business when compared to Data Scientist.

 

Core Skills of a Business Analytics Developer

  1. Business model analysis
  2. Data warehousing
  3. Design of business workflow
  4. Business Intelligence software integration

 

3. Machine Learning Engineer

Once the data is clean and ready for analysis, the machine learning engineers work on these big data to train a predictive model that predicts the target variable. These models are used to analyze the trends of the data in the future so that the organisation can take the right business decisions. As the dataset involved in a real-life scenario would involve a lot of dimensions, it is difficult for a human eye to interpret insights from it. This is one of the reasons for training machine learning algorithms as it easily deals with such complex dataset. These engineers carry out a number of tests and analyze the outcomes of the model.

The reason for conducting constant tests on the model using various samples is to test the accuracy of the developed model. Apart from the training models, they also perform exploratory data analysis sometimes in order to understand the dataset completely which will, in turn, help them in training better predictive models.

 

Core Skills of Machine Learning Engineers

  1. Machine Learning Algorithms
  2. Data Modelling and Evaluation
  3. Software Engineering

 

4. Data Engineer

The pipeline of any data-oriented company begins with the collection of big data from numerous sources. That’s where the data engineers operate in any given project. These engineers integrate data from various sources and optimize them according to the problem statement. The work usually involves writing queries on big data for easy and smooth accessibility. Their day to day responsibility is to provide a streamlined flow of big data from various distributed systems. Data engineering differs from the other data science careers as in, it is concentrated on the system and hardware that aids the company’s data analysis, rather than the analysis of data itself. They provide the organisation with efficient warehousing methods as well.

 

Core Skills of Data Engineer

  1. Database Knowledge
  2. Data Warehousing
  3. Machine Learning algorithm

 

5. Business Analyst

Business Analyst is one of the most essential roles in the Data Science field. These analysts are responsible for understanding the data and it’s related trend post the decision making about a particular product. They store a good amount of data about various domains of the organisation. These data are really important because if any product of the organisation fails, these analysts work on these big data to understand the reason behind the failure of the project. This type of analysis is vital for all the organisations as it makes them understand the loopholes in the company. The analysts not only backtrack the loophole and in turn provide solutions for the same making sure the organisation takes the right decision in the future. At times, the business analyst act as a bridge between the technical team and the rest of the working community.

 

Core skills of Business Analyst

  1. Business awareness
  2. Communication
  3. Process Modelling

 

Conclusion

The data science career options mentioned above are in no particular order. In my opinion, every career option in Data Science field works complimentary with one another. In any data-driven organization, regardless of the salary, every career role is important at the respective stages in a project.

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data courseThis course will equip you with the exact skills required. 

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs

A Comprehensive Guide to Data Mining: Techniques, Tools and Application

A Comprehensive Guide to Data Mining: Techniques, Tools and Application

Introduction

Sifting through very large amounts of data for useful information. Data mining uses artificial intelligence techniques, neural networks, and advanced statistical tools. It reveals trends, patterns, and relationships, which might otherwise have remained undetected. In contrast to an expert system, data mining attempts to discover hidden rules underlying the data. Also called data surfing.

In this blog, we will be presenting a comprehensive detail about data mining. Additionally, this blog will help you to get into the details of data mining. Furthermore, it will help you to get the complete picture in one place!

 

What is Data Mining?

Data mining is not a new concept but a proven technology that has transpired as a key decision-making factor in business. There are numerous use cases and case studies, proving the capabilities of data mining and analysis. Yet, we have witnessed many implementation failures in this field, which can be attributed to technical challenges or capabilities, misplaced business priorities and even clouded business objectives. While some implementations battle through the above challenges, some fail in delivering the right data insights or their usefulness to the business. This article will guide you through guidelines for successfully implementing data mining projects.

Also, data mining is the process of uncovering patterns inside large sets of structured data to predict future outcomes. Structured data is data that is organized into columns and rows so that they can be accessed and modified efficiently. Using a wide range of machine learning algorithms, you can use data mining approaches for a wide variety of use cases to increase revenues, reduce costs, and avoid risks.

Also, at its core, data mining consists of two primary functions, description, for interpretation of a large database and prediction, which corresponds to finding insights such as patterns or relationships from known values. Before deciding on data mining techniques or tools, it is important to understand the business objectives or the value creation using data analysis. The blend of business understanding with technical capabilities is pivotal in making big data projects successful and valuable to its stakeholders.

 

Different Methods of Data Mining

Data mining commonly involves four classes of tasks [1]: (1) classification, arranges the data into predefined groups; (2) clustering, is like classification but the groups are not predefined, so the algorithm will try to group similar items together; (3) regression, attempting to find a function which models the data with the least error; and (4) association rule learning, searching for relationships between variables.

1. Association

Association is one of the best-known data mining technique. In association, a pattern is discovered based on a relationship between items in the same transaction. That’s is the reason why the association technique is also known as relation technique. The association technique is used in market basket analysis to identify a set of products that customers frequently purchase together.

Retailers are using association technique to research customer’s buying habits. Based on historical sale data, retailers might find out that customers always buy crisps when they buy beers, and, therefore, they can put beers and crisps next to each other to save time for the customer and increase sales.

2. Classification

Classification is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network, and statistics. In classification, we develop the software that can learn how to classify the data items into groups. For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.” In this case, we divide the records of employees into two groups named “leave” and “stay”. And then we can ask our data mining software to classify the employees into separate groups.

3. Clustering

Clustering is a data mining technique that makes a meaningful or useful cluster of objects. These objects have similar characteristics using the automatic technique. Furthermore, the clustering technique defines the classes and puts objects in each class. But classification techniques, assignes objects into known classes. To make the concept clearer, we can take book management in the library as an example. In a library, there is a wide range of books on various topics available. The challenge is how to keep those books in a way that readers can take several books on a particular topic without hassle. By using the clustering technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it with a meaningful name.

4. Regression

In statistical terms, a regression analysis is a process of identifying and analyzing the relationship among variables. it can help you understand the characteristic value of the dependent variable changes if any one of the independent variables is varied. this means one variable is dependent on another, but it is not vice versa.it is generally used for prediction and forecasting.

 

Data Mining Process and Tools

The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a conceptual tool that exists as a standard approach to data mining. The process outlines six phases:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modelling
  5. Evaluation
  6. Deployment

The first two phases, business understanding and data understanding, are both preliminary activities. It is important to first define what you would like to know and what questions you would like to answer and then make sure that your data is centralized, reliable, accurate, and complete.

Once you’ve defined what you want to know and gathered your data, it’s time to prepare your data — this is where you can start to use data mining tools. Data mining software can assist in data preparation, modelling, evaluation, and deployment. Data preparation includes activities like joining or reducing data sets, handling missing data, etc.

The modelling phase in data mining is when you use a mathematical algorithm to find a pattern(s) that may be present in the data. This pattern is a model that can be applied to new data. Data mining algorithms, at a high level, fall into two categories — supervised learning algorithms and unsupervised learning algorithms. Supervised learning algorithms require a known output, sometimes called a label or target. Supervised learning algorithms include Naïve Bayes, Decision Tree, Neural Networks, SVMs, Logistic Regression, etc. Unsupervised learning algorithms do not require a predefined set of outputs but rather look for patterns or trends without any label or target. These algorithms include k-Means Clustering, Anomaly Detection, and Association Mining.

Data evaluation is the phase that will tell you how good or bad your model is. Cross-validation and testing for false positives are examples of evaluation techniques available in data mining tools. The deployment phase is the point at which you start using the results.

 

Importance of Data Mining

 

1. Marketing / Retail

Data mining helps marketing companies build models based on historical data to predict who will respond to the new marketing campaigns such as direct mail, online marketing campaign…etc. Through the results, marketers will have an appropriate approach to selling profitable products to targeted customers.

Data mining brings a lot of benefits to retail companies in the same way as marketing. Through market basket analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent buying products together with pleasant. In addition, it also helps retail companies offer certain discounts for particular products that will attract more customers.

2. Finance / Banking

Data mining gives financial institutions information about loan information and credit reporting. By building a model from historical customer’s data, the bank, and financial institution can determine good and bad loans. In addition, data mining helps banks detect fraudulent credit card transactions to protect the credit card’s owner.

3. Manufacturing

By applying data mining in operational engineering data, manufacturers can detect faulty equipment and determine optimal control parameters. For example, semiconductor manufacturers have a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are a lot the same and some for unknown reasons even has defects. Also, data mining has been applying to determine the ranges of control parameters that lead to the production of the golden wafer.

4. Governments

Data mining helps government agency by digging and analyzing records of the financial transaction to build patterns that can detect money laundering or criminal activities.

 

Applications of Data Mining

  • There are approximately 100,000 genes in the human body. Each gene is composed of hundreds of individual nucleotides which are arranged in a particular order. Ways of these nucleotides being ordered and sequenced are infinite to form distinct genes. Data mining technology can be used to analyze the sequential pattern. You can use it to search similarity and to identify particular gene sequences. In the future, data mining technology will play a vital role in the development of new pharmaceuticals. Also, it may provide advances in cancer therapies. 
  • Financial data collected in the banking and financial industry is often relatively complete, reliable, and of high quality. This facilitates systematic data analysis and data mining. Typical cases include classification and clustering of customers for targeted marketing. It can also include detection of money laundering and other financial crimes. Furthermore, we can look into the design and construction of data warehouses for multidimensional data analysis. 
  • The retail industry is a major application area for data mining since it collects huge amounts of data on customer shopping history, consumption, and sales and service records. Data mining on retail is able to identify customer buying habits, to discover customer purchasing pattern and to predict customer consuming trends. This technology helps design effective goods transportation, distribution policies, and less business cost.
  • Also, data mining in the telecommunication industry can help understand the business involved, identify telecommunication patterns, catch fraudulent activities, make better use of resources and improve service quality. Moreover, the typical cases include multidimensional analysis of telecommunication data, fraudulent pattern analysis and the identification of unusual patterns as well as multidimensional association and sequential pattern analysis.

 

Summary

The more data you collect…the more value you can deliver. And the more value you can deliver…the more revenue you can generate.

Data mining is what will help you do that. So, if you are sitting on loads of customer data and not doing anything with it…I want to encourage you to make a plan to start diving into it this week. Do it yourself or hire someone else…whatever it takes. Your bottom line will thank you.

Always query yourself how are you bringing value to your business with data mining!

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs

Basic Statistics Concepts Every Data Scientist Should know

A Comprehensive Guide to Data Science With Python

Data Visualization with R

What is the Difference Between: Data Science, Data Mining and Machine Learning

data mining vs machine learning vs data science

Source: Quora.com

The advancement in the analytical eco-space has reached new heights in the recent past. The emergence of new tools and techniques has certainly made life easier for an analytics professional to play around with the data. Moreover, the massive amounts of data that’s getting generated from diverse sources need huge computational power and storage system for analysis.

Three of the most commonly used terms in analytics are Data mining, Machine Learning, and Data Science which is a combination of both. In this blog post, we would look into each of these three buzzwords along with examples.

Data Mining:

By term ‘mining’ we refer to extracting some object by digging. Similarly, that analogy could be applied to data where information could be extracted by digging into it. Data mining is one of the most used terms these days. Unlike previously, our life is circulated entirely by big data and we have the tools and techniques to handle such voluminous diverse meaningful data.

In the data, there are a lot of patterns which people could discover once the data has been gathered from relevant sources. The hidden patterns could be extracted to provide valuable insights by combining multiple sources of data even if it is junk. This entire process is known as Data mining.

Now the data used for mining could be enterprise data which are restricted and secured and has privacy issues. It could also be an integration of multiple sources which includes financial data, third-party data, etc. The more the data available to us, the better it is as we need to find patterns and insights in sequential and non-sequential data.

The steps involved in data mining are –

  • Data Collection – This is one of the most important steps in Data mining as getting the correct data is always a challenge in any organization. To find patterns in the data, we need to ensure that the source of the data is accurate and as much as possible data is gathered.
  • Data Cleaning – A lot of the times the data we get is not clean enough to draw insights from it. There could be missing values, outliers, NULL in the data which needs to be handled either by deletion or by imputation based on its significance to the business.
  • Data Analysis – Once the data is gathered, and cleaned the next step is to analyze the data which in short known as Exploratory Data Analysis. Several techniques and methodologies are applied in this step to derive relevant insights from the data.
  • Data Interpretation – Only analyzing the data is worthless unless it is interpreted through the form of graphs or charts to the stakeholders or the business who would make conclusions based on the analysis.

Data mining has several usages in the real world. For example, if we take the logs data for login in a web application, we would see that the data is messy containing information like timestamp, activities of the user, time spent on the website, etc. However, if we clean the data, and then analyze it, we would find some relevant information from it such as the user’s regular habit, the peak time for most of the activities, and so on. All this information could help to increase the efficiency of the system.

Another example of data mining is in crime prevention. Though data mining has most usage in education and healthcare, it is also used by agencies in the crime department to spot patterns in the data. This data would consist of information about some of the criminal activities that have taken place. Hence, mining, and gathering information from the data would help the agencies to predict future crime events and prevent it from occurring. The agencies could mine the data and find out the place where the next crime could take place. They could also prevent cross-border calamity by understanding which vehicle to check, the age of the occupants, etc.

However, a few of the important points one should remember about Data Mining –

  • Data mining should not be considered as the first solution to any analysis task if other accurate solutions are applicable. It should be used when such solutions fail to provide value.
  • Sufficient amount of data should be present to draw insights from it.
  • The problem should be understood to be a Regression or a Classification one.

Machine Learning:

Previously, we learned about Data mining which is about gathering, cleaning, analyzing, and interpreting relevant insights from the data for the business to draw conclusions from it.

If Data mining is about describing a set of events, Machine Learning is about predicting the future events. It is the term coined to define a system which learns from past data to generalize and predict the future events from the unknown set of data.

Machine Learning could be divided into three categories –

  • Supervised Learning – In supervised learning, the target is labeled i.e., for every corresponding row there is an output value.
  • Unsupervised Learning – The data set is unlabelled in unsupervised learning i.e., one has to cluster the data into various groups based on the similarities in the pattern of the data points.
  • Reinforcement Learning – It is a special category of Machine Learning which is mostly used in self-driving cars. In reinforcement learning, the learner is rewarded for every correct move, and penalized for any incorrect move.

The field of Machine Learning is vast, and it requires a blend of statistics, programming, and most importantly data intuition to master it. Supervised and unsupervised learning are used to solve regression, classification, and clustering problems.

  • In regression problems, the target is numeric i.e., continuous or discrete in nature. A continuous value could be an integer, float, or a decimal, whereas a discrete value is a number or an integer.
  • In classification problems, the target is categorical i.e., binary, multinomial, or ordinal in nature.
  • In clustering problems, the dataset is grouped into different clusters based on the similar properties among the data in a particular group.

Machine Learning has a vast usage in various fields such as Banking, Insurance, Healthcare, Manufacturing, Oil and Gas, and so on. Professionals from various disciplines feel the need to predict future outcomes in order to work efficiently and prepare for the best by taking appropriate actions. Some of the real-life examples where Machine Learning has found its usage is –

  • Email Spam filtering – This is the first application of Machine Learning where an email is classified as ‘Spam’ or ‘Not Spam’ based on certain keywords in the mail. It is a binary classification supervised learning problem where the system is initially trained with a set of sample emails to learn the patterns which would help in filtering out irrelevant emails. Once the system has generalized well, it is passed through a validation set to check for its efficiency, and then through a test set to find its accuracy.
  • Credit Risk Analytics – Machine Learning has vast influence in the Banking, and Insurance domain with one of its usage being in predicting the delinquency of a loan by a borrower. Defaulting a credit loan is a prevalent issue in which the lender or the bank has lost millions by failing to identify the possibility of a borrower not repaying back the loans or meeting the contractual agreements. However, Machine Learning has been introduced by various banks which takes into several features of a borrower and builds a predictive model which helps in mitigating the risk involved in giving credit card loans to them.
  • Product Recommendations – Flipkart, and Amazon are of the two biggest e-commerce industry in the world where millions of users shop every day the products of their choice. However, there is a recommendation algorithm that works behind the scenes which simplify the life of the customer by displaying them the products they make like based on their previous shopping or search patterns. This is an example of unsupervised learning where a customer is grouped based on their shopping patterns.

Data Science: 

So far, we have learned about the two most common and important terms in Analytics i.e., Data mining and Machine Learning.

If Data mining deals with understanding and finding hidden insights in the data, then Machine Learning is about taking the cleaned data and predicting future outcomes. All of these together form the core of Data Science.

Data Science is a holistic study which involves both Descriptive and Predictive Analytics. A Data Scientist needs to understand and perform exploratory analysis as well as employ tools, and techniques to make predictions from the data.

A Data Scientist role is a mixture of the work done by a Data Analyst, a Machine Learning Engineer, a Deep Learning Engineer, or an AI researcher. Apart from that, a Data Scientist might also be required to build data pipelines which is the work of a Data Engineer. The skill set of a Data Scientist consists of Mathematics, Statistics, Programming, Machine Learning, Big Data, and communication.

Some of the applications of Data Science in the modern world are –

  • Virtual assistant – Amazon’s Alexa, and Apple’s Siri are two of the biggest achievements in the recent past where AI has been used to build human-like intelligent systems. A virtual assistant could perform most of the tasks that a human being could with proper instructions.
  • ChatBot – Another common usage of Data Science is the ChatBot development which is now being integrated into almost every corporation. A technique called Natural Language Processing is in the core of ChatBot development.
  • Identifying cancer cells – Deep Learning has made tremendous progress in the healthcare sector where it is used to identify the pattern in the cells to predict whether it is cancerous or not. Deep Learning uses neural networks which functions like the human brain.

 

Conclusion

Data mining, Machine Learning, and Data Science is a broad field and it would require quite a few things to learn to master all these skills.

Dimensionless has several resources to get started with.

To Learn Data Science, Get Data Science Training in Pune and Mumbai from Dimensionless Technologies.

To Learn more about data science, Click to read Data Science Blog.

Also Read:

Machine Learning for Transactional Analytics: Acquisition Cost Vs Lifetime Value

7 Technical Concept Every Data Science Beginner Should Know

Building Blocks of Decision Tree

It’s been said that Data Scientist is the “sexiest job title of the 21st century.”  This is because of one main reason that there is a humongous amount of data available as we are producing data at a rate as never before. With the dramatic access to data, there are sophisticated algorithms present such as Decision trees, Random Forests etc. When there is a humongous amount of data available, the most intricate part is to select the correct algorithm to solve the problem. Each model has its own pros and cons and should be selected depending on the type of problem at hand and data available.

Decision Trees:

The aim of this blog post is to discuss one of the most widely used Machine Learning algorithm: “Decision Trees”. As the name suggests, it uses a tree-like model to make decisions as shown in below figure.  Decision Tree is drawn upside down with its root at the top. A question is asked at every node based on which decision tree splits into branches. The end of the tree which doesn’t split further is called as Leaf.


Decision tree | dimensionless

Decision Trees can be used for classification as well as regression problems. That’s why there are called as Classification or Regression Trees(CART). In the above example, a decision tree is being used for a classification problem to decide whether a person is fit or unit. The depth of the tree is referred to length of the tree from root node to leaf.

For basics of the decision tree, refer:

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

Have you ever given the thought that if there are so many sophisticated algorithms available such as neural networks which are better in terms of parameters such as accuracy then why decision trees are one of the most widely used algorithms?

The biggest advantage of Decision Trees is interpretability. Let’s talk about neural networks to understand this. To make the concept of neural network easy to understand, let’s consider the neural network as “Black Box”. The set of input data is given to the black box and it produces the output corresponding to the input data set. Now, What’s inside the black box?  Black Box consists of a computational unit which consists of several hidden layers depending on the intricacy of problem. Also, a large amount of data set is required to train these hidden layers. With the increased no. of hidden layers, there is a significant increase in the complexity of neural networks. It becomes very hard to interpret the output of neural networks in such cases. That’s where lies the importance of decision trees. Decision Trees interpretability helps the humans to understand what’s happening inside the black box? This can help significantly to improve the performance of neural networks in terms of several parameters such as accuracy, avoiding the overfitting etc.

Another advantage of Decision Trees includes a nonlinear relationship between the parameters doesn’t affect the tree performance, Decision implicitly performs the feature selection and minimal effort for data cleaning.

As already discussed, every algorithm has it’s pros and cons. Disadvantages of Decision Trees include poor performance if the decision tree is overfitted to data and could not generalize well.  Decision trees can be unstable for small variations of data. So, this variation should be reduced by methods such as bagging, boosting etc.

If you have used ever implemented decision trees: Have you ever thought what’s happening in the background when you implement a decision tree using sci-kit learn in Python? Let’s understand the nitty-gritty of decision trees i.e. various functions such as train test split, checking the purity of data, classification of data, calculating the overall entropy of data etc. that runs in the background. Let’s understand the concept of the decision tree by implementing it from scratch i.e. with the help from numpy and pandas (without using skicit learn).

Here, I am using Titanic dataset to build a decision tree. Decision tree needs to be trained to classify whether the passenger is dead or survived based on parameters such as Age, gender, Pclass. Note that titanic data set contains various variables such as passenger name, address etc which are dropped because they are just identifiers and doesn’t add value to the decision tree. This process is formally called “Feature Selection.”

Decision tree - 2 | dimensionless 

Data Cleaning:

The first step toward building the ML algorithm is data cleaning. It is one of the most important step because the model build on unformatted data can affect the performance of the model significantly. I am going to use the Titanic data set to build the decision tree. The aim of this model is to classify the passengers as survived or not based on information given. The first step is to load the data set, clean it. Cleaning of data consists of 2 steps:

Dropping the variables which are of least importance in deciding. In Titanic dataset columns such as name, cabin no. ticket no. is of least importance. So, they have been dropped.

Fill in all the missing values i.e. replace NA’s with the most suitable central tendency. All the NA’s are replaced with Mean in case of continuous variables and mode in case of categorical variables. Once the data is clean, we will try to build the helper functions which will help us to write the main algorithm.

Train-Test Split:

We are going to divide the data set into 2 sets i.e. train data set and test data set. I have kept the train test split ratio as 80:20. However, this could be different. The best practice is to keep the test data set small as compared to the train data set depending on the size of the data set. But the test should not be so small that it is not representative of the population

In the case of large data sets, the data set is divided into 3 categories: Training Data Set, Validation Data Set, Test Data Set. Train data is used to train the model, validation data set is used to tune the model in terms of parameters such as accuracy, overfitting. Once the model is verified for the optimal accuracy, it can be used for testing.

Check Data Purity and classify:

As shown in the block diagram, Data Pure function check the data set for its purity i.e. if the data set contains only one species of flower. If so, it will classify the data. If the data sets contain different species of flowers, it will try to ask the questions which can accurately segregate the different species of flowers. It is done by implementing functions such as potential splits, split data, calculate the overall entropy. Let’s understand each of them in detail.

 

 

Potential Splits Function:

Potential splits function gives all the possible splits for all of the variables. In Titanic dataset, there are 2 kinds of variables Categorical and Continuous. Both the types of variables are handled in a different way.

For a categorical variable, each unique value is taken as possible split. Let’s take the example of gender. Gender can only take up 2 values either male or female. Possible splits are male and Female. The question will be simple: Is Gender == “Male”. In this way, the decision tree can segregate the population based on Gender.

Since the continuous variable can take on any value. So, the potential split can be exactly in the midpoint of two values in the data set. Let’s understand this with the help of “Fare” Variable in Titanic data set. Suppose, Fare = {10,20,30,40,50}. Potential Splits for Fare will be: 15, 25,35,45. If we ask the question “If Fare <=25”, we can segregate the data effectively. Since Fare variable can take up any value between 10 to 50 in the above case, we can’t deal the continuous variable in the same way as a categorical variable.

Once potential split function gives all the potential splits for all the variables. Split data function is used to split data based on each and every potential split. It divides the data into 2 halves, data above and data below. Entropy is calculated for each split.

 

Measure of Impurity:

There are two methods to measure the impurity: Gini Impurity and Information Gain Entropy. Both the methods work the same and the selection of impurity measure has little impact on the performance of the decision tree. But Entropy is computationally expensive since it deals with the logarithmic functions. This is the main reason for the large-scale use of Gini Impurity over Information Gain Entropy. Below are formulae for both:

  1. Gini: Gini(E)= 1-∑j=1(pj2)
  2. Entropy: H(E)=−∑cj=1(pjlogpj)

I have used Information Gain Entropy as a measure of impurity. You can use either of them, as both give pretty much same results specifically in case of CART analytics as shown in below figure. Entropy in simpler terms is the measure of randomness or uncertainty. For each potential split, entropy will be calculated. The best potential split will be the one with the lowest overall entropy. This is because of lower the entropy, lower the uncertainty and hence more the probability.

method of impurity measure | Gini Impurity

Refer to below link for the comparison between the two methods of impurity measure:

https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

To calculate entropy in titanic dataset example, calculate entropy and calculate overall entropy functions are used. The functions are defined based on the equations explained above. Once the data is split into two halves by split data function, entropy is calculated for each and every potential split. The potential split with lowest overall entropy is selected as best split with the help of determining the best split function

Determine type of feature:

Determine type of feature functions determine whether a type of feature is categorical or continuous. There are 2 criteria’s for the feature to be called as categorical, first if the feature is of data type string and second, the no. of categories for the feature is less than 10. Otherwise, the feature is said to be continuous.

Determine type of feature determines whether the function is categorical or not based on the above criteria which act as input to a potential split function. This is because potential split has a different way to handle categorical and continuous data as discussed in the potential split function above

 

Decision Tree Algorithm:

Since we have built all the helper functions, it’s now time to build the decision tree algorithm.

Target Decision Tree should look like:

target decision tree | Dimensionless

 

 

As shown in the above diagram, Decision Tree consists of several sub-trees. Each sub-tree is a dictionary where “Question” is key of the dictionary and there are two answers corresponding to each question i.e. Yes answer and No answer.

Representation of sub-tree is given as follows:

sub_tree= {“question”: [“yes_answer”, “no_answer”]}

Decision Tree Algorithm is the main algorithm which is used to train the model with the help of helper functions which we built previously. Firstly, Train Test Split function is called which divides the data set into train data and test data. Once the data is split, Data Pure and Classify function is called to check the purity of data and classify the data based on purity. The potential Split function gives all the potential splits for all variables. Overall Entropy is calculated for each potential split and eventually, potential split with lowest overall entropy is selected and split data function splits the data into two parts. In this way, the subtree is built and a series of sub-trees constitutes our target Decision Tree.

Once the model is trained on training dataset, the performance of Decision Tree is verified on the test data set with the help of classifying function. The performance of the model is measured in terms of accuracy by calculating accuracy function. The performance of the model can be improved by pruning the tree based on the max depth of the tree. The accuracy of the model is coming out to be 77%.