A Comprehensive Guide To Data Science With Python

A Hearty Welcome to You!

I am so thrilled to welcome you to the absolutely awesome world of data science. It is an interesting subject, sometimes difficult, sometimes a struggle but always hugely rewarding at the end of your work. While data science is not as tough as, say, quantum mechanics, it is not high-school algebra either.

It requires knowledge of Statistics, some Mathematics (Linear Algebra, Multivariable Calculus, Vector Algebra, and of course Discrete Mathematics), Operations Research (Linear and Non-Linear Optimization and some more topics including Markov Processes), Python, R, Tableau, and basic analytical and logical programming skills.

.Now if you are new to data science, that last sentence might seem more like pure Greek than simple plain English. Don’t worry about it. If you are studying the Data Science course at Dimensionless Technologies, you are in the right place. This course covers the practical working knowledge of all the topics, given above, distilled and extracted into a beginner-friendly form by the talented course material preparation team.

This course has turned ordinary people into skilled data scientists and landed them with excellent placement as a result of the course, so, my basic message is, don’t worry. You are in the right place and with the right people at the right time.

What is Data Science?

To quote Wikipedia:

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: “use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems.”
From Source

More Greek again, you might say.

Hence my definition:

Data Science is the art of extracting critical knowledge from raw data that provides significant increases in profits for your organization.

We are surrounded by data (Google ‘data deluge’ and you’ll see what I mean). More data has been created in the last two years that in the last 5,000 years of human existence.

The companies that use all this data to gain insights into their business and optimize their processing power will come out on top with the maximum profits in their market.

Companies like Facebook, Amazon, Microsoft, Google, and Apple (FAMGA), and every serious IT enterprise have realized this fact.

Hence the demand for talented data scientists.

I have much more to share with you on this topic, but to keep this article short, I’ll just share the links below which you can go through in your free time (everyone’s time is valuable because it is a strictly finite resource):

You can refer to:

The Demand and Salary Of A Data Scientist

and an excellent introductory article below.

An Introduction to Data Science:

Article Organization

image result for exploring data science — *From Pexels*

Now as I was planning this article a number of ideas came to my mind. I thought I could do a textbook-like reference to the field, with Python examples.

But then I realized that true competence in data science doesn’t come when you read an article.

True competence in data science begins when you take the programming concepts you have learned, type them into a computer, and run it on your machine.

And then; of course, modify it, play with it, experiment, run single lines by themselves, see for yourselves how Python and R work.

That is how you fall in love with coding in data science.

At least, that’s how I fell in love with simple C coding. Back in my UG in 2003. And then C++. And then Java. And then .NET. And then SQL and Oracle. And then… And then… And then… And so on.

If you want to know, I first started working in back-propagation neural networks in the year 2006. Long before the concept of data science came along! Back then, we called it artificial intelligence and soft computing. And my final-year project was coded by hand in Java.

Having come so far, what have I learned?

That it’s a vast massive uncharted ocean out there.

The more you learn, the more you know, the more you become aware of how little you know and how vast the ocean is.

But we digress!

To get back to my point –

My final decision was to construct a beginner project, explain it inside out, and give you source code that you can experiment with, play with, enjoy running, and modify here and there referring to the documentation and seeing what everything in the code actually does.

Kaggle – Your Home For Data Science

image result for kaggle — www.kaggle.com

If you are in the data science field, this site should be on your browser bookmark bar. Even in multiple folders, if you have them.

Kaggle is the go-to site for every serious machine learning practitioner. They hold competitions in data science (which have a massive participation), have fantastic tutorials for beginners, and free source code open-sourced under the Apache license (See this link for more on the Apache open source software license – don’t skip reading this, because as a data scientist this is something about software products that you must know).

As I was browsing this site the other day, a kernel that was attracting a lot of attention and upvotes caught my eye.

This kernel is by a professional data scientist by the name of Fatma Kurçun from Istanbul (the funny-looking ç symbol is called c with cedilla and is pronounced with an s sound).

It was quickly clear why it was so popular. It was well-written, had excellent visualizations, and a clear logical train of thought. Her professionalism at her art is obvious.

Since it is an open source Apache license released software, I have modified her code quite a lot (diff tool gives over 100 changes performed) to come up with the following Python classification example.

But before we dive into that, we need to know what a data science project entails and what classification means.

Let’s explore that next.

Classification and Data Science

So supervised classification basically means mapping data values to a category defined in advance. In the image above, we have a set of customers who have certain data values (records). So one dot above corresponds with one customer with around 10-20 odd fields.

Now, how do we ascertain whether a customer is likely to default on a loan, and which customer is likely to be a non-defaulter? This is an incredibly important question in the finance field! You can understand the word, “classification”, here. We classify a customer into a defaulter (red dot) class (category) and a non-defaulter (green dot) class.

This problem is not solvable by standard methods. You cannot create and analyze a closed-form solution to this problem with classical methods. But – with data science – we can approximate the function that captures or models this problem, and give a solution with an accuracy range of 90-95%. Quite remarkable!

Now, again we can have a blog article on classification alone, but to keep this article short, I’ll refer you to the following excellent articles as references:

Link 1 and Link 2

Steps involved in a Data Science Project

A data science project is typically composed of the following components:

Defining the Problem
Collecting Data from Sources
Data Preprocessing
Feature Engineering
Algorithm Selection
Hyperparameter Tuning
Repeat steps 4–6 until error levels are low enough.
Data Visualization
Interpretation of Results

If I were to explain each of these terms – which I could – but for the sake of brevity – I can ask you to refer to the following articles:

How to Make Machine Learning Models for Beginners

and:

Steps to perform data science with Python- Medium

At some time in your machine learning career, you will need to go through the article above to understand what a machine learning project entails (the bread-and-butter of every data scientist).

Jupyter Notebooks

To run the exercises in this section, we use a Jupyter notebook. Jupyter is short for Julia, Python, and R. This environment uses kernels of any of these languages and has an interactive format. It is commonly used by data science professionals and is also good for collaboration and for sharing work.

To know more about Jupyter notebooks, I can suggest the following article (read when you are curious or have the time):

Data Science Libraries in Python

image result for scikit learn — The standard data science stack for Python has the **scikit-learn** Python library as a basic lowest-level foundation.

The scikit-learn python library is the standard library in Python most commonly used in data science. Along with the libraries numpy, pandas, matplotlib, and sometimes seaborn as well this toolset is known as the standard Python data science stack. To know more about data science, I can direct you to the documentation for scikit-learn – which is excellent. The text is lucid, clear, and every file contains a working live example as source code. Refer to the following links for more:

Link 1 and Link 2

This last link is like a bible for machine learning in Python. And yes, it belongs on your browser bookmarks bar. Reading and applying these concepts and running and modifying the source code can help you go a long way towards becoming a data scientist.

And, for the source of our purpose

Our Problem Definition

This is the classification standard data science beginner problem that we will consider. To quote Kaggle.com:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

From: Kaggle

We’ll be trying to predict a person’s category as a binary classification problem – survived or died after the Titanic sank.

So now, we go through the popular source code, explaining every step.

Import Libraries

These lines given below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import seaborn as sns
%matplotlib inline

Are standard for nearly every Python data stack problem. Pandas refers to the data frame manipulation library. NumPy is a vectorized implementation of Python matrix manipulation operations that are optimized to run at high speed. Matplotlib is a visualization library typically used in this context. Seaborn is another visualization library, at a little higher level of abstraction than matplotlib.

The Problem Data Set

We read the CSV file:

train = pd.read_csv('../input/train.csv')

Exploratory Data Analysis

Now, if you’ve gone through the links given in the heading ‘Steps involved in Data Science Projects’ section, you’ll know that real-world data is messy, has missing values, and is often in need of normalization to adjust for the needs of our different scikit-learn algorithms. This CSV file is no different, as we see below:

Missing Data

This line uses seaborn to create a heatmap of our data set which shows the missing values:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Output:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b5ed98ef0>

Interpretation

The yellow bars indicate missing data. From the figure, we can see that a fifth of the Age data is missing. And the Cabin column has so many missing values that we should drop it.

Graphing the Survived vs. the Deceased in the Titanic shipwreck:

sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')

Output:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54fe2390>

As we can see, in our sample of the total data, more than 500 people lost their lives, and less than 350 people survived (in the sample of the data contained in train.csv).

When we graph Gender Ratio, this is the result.

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

Output

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54f49da0>

Over 400 men died, and around 100 survived. For women, less than a hundred died, and around 230 odd survived. Clearly, there is an imbalance here, as we expect.

Data Cleaning

The missing age data can be easily filled with the average of the age values of an arbitrary category of the dataset. This has to be done since the classification algorithm cannot handle missing values and will be error-ridden if the data values are not error-free.

plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

Output

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54d132e8>

We use these average values to impute the missing values (impute – a fancy word for filling in missing data values with values that allow the algorithm to run without affecting or changing its performance).

def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

Missing values heatmap:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Output:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54a0d0b8>

We drop the Cabin column since its mostly empty.

train.drop('Cabin',axis=1,inplace=True)

Convert categorical features like Sex and Name to dummy variables using pandas so that the algorithm runs properly (it requires data to be numeric)..

train.info()

Output:

<class 'pandas.core.frame.DataFrame'> Int64
Index: 889 entries, 0 to 890 
Data columns (total 11 columns):

PassengerId 889 non-null int64 
Survived 889 non-null int64 
Pclass 889 non-null int64 
Name 889 non-null object 
Sex 889 non-null object 
Age 889 non-null float64 
SibSp 889 non-null int64 
Parch 889 non-null int64 
Ticket 889 non-null object 
Fare 889 non-null float64 
Embarked 889 non-null object 

dtypes: float64(2), int64(5), object(4) 
memory usage: 83.3+ KB

More Data Preprocessing

We use one-hot encoding to convert the categorical attributes to numerical equivalents. One-hot encoding is yet another data preprocessing method that has various forms. For more information on it, see the link

sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Embarked','Name','Ticket'],axis=1,inplace=True)
train = pd.concat([train,sex,embark],axis=1)

Finally, we check the heatmap of features again:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Output

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54743ac8>

No missing data and all text converted accurately to a numeric representation means that we can now build our classification model.

Building a Gradient Boosted Classifier model

Gradient Boosted Classification Trees are a type of ensemble model that has consistently accurate performance over many dataset distributions.
I could write another blog article on how they work but for brevity, I’ll just provide the link here and link 2 here:

We split our data into a training set and test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], test_size=0.10, random_state=0)

Training:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X_train,y_train)

Output:

GradientBoostingClassifier(
criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)

Predicting:

predictions = model.predict(X_test)

predictions

Output

array([0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 
0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 
0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0])

Performance

The performance of a classifier can be determined by a number of ways. Again, to keep this article short, I’ll link to the pages that explain the confusion matrix and the classification report function of scikit-learn and of general classification in data science:

Confusion Matrix

Predictive Model Evaluation

A wonderful article by one of our most talented writers. Skip to the section on the confusion matrix and classification accuracy to understand what the numbers below mean.

For a more concise, mathematical and formulaic description, read here

from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test,predictions))

[[89 16]
 [29 44]]

So as not make this article too disjointed, let me explain at least the confusion matrix to you.

The confusion matrix has the following form:

[[ TP FP ]

[ FN TN ]]

The abbreviations mean:

TP – True Positive – The model correctly classified this person as deceased.

FP – False Positive – The model incorrectly classified this person as deceased.

FN – False Negative – The model incorrectly classified this person as a survivor

TN – True Negative – The model correctly classified this person as a survivor.

So, in this model published on Kaggle, there were:

89 True Positives

16 False Positives

29 False Negatives

44 True Negatives

Classification Report

You can refer to the link here

-to learn everything you need to know about the classification report.

print(classification_report(y_test,predictions))

               precision   recall   f1-score  support            
           0     0.75       0.85      0.80      105 
           1     0.73       0.60      0.66       73 
   micro avg     0.75       0.75      0.75      178 
   macro avg     0.74       0.73      0.73      178 
weighted avg     0.75       0.75      0.74      178

So the model, when used with Gradient Boosted Classification Decision Trees, has a precision of 75% (the original used Logistic Regression).

Wrap-Up

I have attached the dataset and the Python program to this document, you can download it by clicking on these links. Run it, play with it, manipulate the code, view the scikit-learn documentation. As a starting point, you should at least:

Use other algorithms (say LogisticRegression / RandomForestClassifier a the very least)
Refer the following link for classifiers to use:
Sections 1.1 onwards – every algorithm that has a ‘Classifier’ ending in its name can be used – that’s almost 30-50 odd models!
Try to compare performances of different algorithms
Try to combine the performance comparison into one single program, but keep it modular.
Make a list of the names of the classifiers you wish to use, apply them all and tabulate the results. Refer to the following link:
Use XGBoost instead of Gradient Boosting

Titanic Training Dataset (here used for training and testing):

titanic.csv – Download

Address of my GitHub Public Repo with the Notebook and code used in this article:

Github Code

Clone with Git (use TortoiseGit for simplicity rather than the command-line) and enjoy.

To use Git, take the help of a software engineer or developer who has worked with it before. I’ll try to cover the relevance of Git for data science in a future article.

But for now, refer to the following article here

You can install Git from Git-SCM and TortoiseGit

To clone,

Install Git and TortoiseGit (the latter only if necessary)
Open the command line with Run… cmd.exe
Create an empty directory.
Copy paste the following string into the command prompt and watch the magic after pressing Enter: “git clone https://github.com/thomascherickal/datasciencewithpython-article-src.git” without the double quotes, of course.

Use Anaconda (a common data science development environment with Python,, R, Jupyter, and much more) for best results.

Cheers! All the best into your wonderful new adventure of beginning and exploring data science!

A Comprehensive Guide to Data Science With Python

A Hearty Welcome to You!

What is Data Science?

Article Organization

Kaggle – Your Home For Data Science

Classification and Data Science

Steps involved in a Data Science Project

Jupyter Notebooks

Data Science Libraries in Python

Our Problem Definition

Import Libraries

The Problem Data Set

Exploratory Data Analysis

Missing Data

Data Cleaning

More Data Preprocessing

Building a Gradient Boosted Classifier model

Training:

Predicting:

Performance

Confusion Matrix

Classification Report

Wrap-Up

Trackbacks/Pingbacks

Submit a Comment Cancel reply

Recent Posts

Topics

A Comprehensive Guide to Data Science With Python

A Hearty Welcome to You!

What is Data Science?

Article Organization

Kaggle – Your Home For Data Science

Classification and Data Science

Steps involved in a Data Science Project

Jupyter Notebooks

Data Science Libraries in Python

Our Problem Definition

Import Libraries

The Problem Data Set

Exploratory Data Analysis

Missing Data

Data Cleaning

More Data Preprocessing

Building a Gradient Boosted Classifier model

Training:

Predicting:

Performance

Confusion Matrix

Classification Report

Wrap-Up

Trackbacks/Pingbacks

Submit a Comment Cancel reply

Recent Posts

Topics

Tags