9923170071 / 8108094992 info@dimensionless.in
Univariate Analysis – A Key to the Mystery Behind Data!

Univariate Analysis – A Key to the Mystery Behind Data!

 

Exploratory Data Analysis or EDA is that stage of Data Handling where the Data is intensely studied and the myriad limits are explored. EDA literally helps to unfold the mystery behind such data which might not make sense at first glance. However, with detailed analysis, we can use the same data to provide miraculous results which can help boost large scale business decisions with excellent accuracy. This not only helps business conglomerations to evade likely pitfalls in the future but also helps them to leverage from the best possible schemes that might emerge in the near future.

 

EDA employs three primary statistical techniques to go about this exploration:

  • Univariate Analysis
  • Bivariate Analysis
  • Multivariate Analysis

Univariate, as the name suggests, means ‘one variable’ and studies one variable at a time to help us formulate conclusions such as follows:

  • Outlier detection
  • Concentrated points
  • Pattern recognition
  • Required transformations

 

In order to understand these points, we will take up the iris dataset which is furnished by fundamental python libraries like scikit-learn.

The iris dataset is a very simple dataset and consists of just 4 specifications of iris flowers: sepal length and width, petal length and width (all in centimeters). The objective of this dataset is to identify the type of iris plant a flower belongs to. There are three such categories: Iris Setosa, Iris Versicolour, Iris Virginica).

So let’s dig right in then!

 

1. Description Based Analysis

 

The purpose of this stage is to get an initial idea about each variable independently. This helps to identify the irregularities and probable patterns in the variables. Python’s inbuilt panda’s library helps to execute this task with extreme ease by literally using just one line of code.


Code:

data = datasets.load_iris()

The iris dataset is in dictionary format and thus, needs to be changed to data frame format so that the panda’s library can be leveraged.

We will store the independent variables in ‘X’. ‘data’ will be extracted and converted as follows:

X = data[‘data’]  #extract


X = pd.DataFrame(X) #convert

On conversion to the required format, we just need to run the following code to get the desired information:

X.describe() #One simple line to get the entire description for every column

Output:

Output for desired code

 

  • Count refers to the number of records under each column.
  • Mean gives the average of all the samples combined. Also, it is important to note that the mean gets highly affected by outliers and skewed data and we will soon be seeing how to detect skewed data just with the help of the above information.
  • Std or Standard Deviation is the measure of the “spread” of data in simple terms. With the help of std we can understand if a variable has values populated closely around the mean or if they are distributed over a wide range.
  • Min and Max give the minimum and maximum values of the columns across all records/samples.

 

25%, 50%, and 75% constitute the most interesting bit of the description. The percentiles refer to the respective percentage of records which behave a certain way. It can be interpreted in the following way:

  1. 25% of the flowers have sepal length equal to or less than 5.1 cm.
  2. 50% of the flowers have a sepal width equal to or less than 3.0 cm and so on.

50% is also interpreted as the median of the variable. It represents the data present centrally in the variable. For example, if a variable has values in the range 1 and 100 and its median is 80, it would mean that a lot of data points are inclined towards a higher value. In simpler terms, 50% or half of the data points have values greater than or equal to 80.

Now that the performance of mean and median is demonstrated, from the behavior of these numbers, one can conclude if the data is skewed. If the difference is high, it suggests that the distribution is skewed and if it is almost negligible, it is indicative of a normal distribution.

These options work well with continuous variables like the ones mentioned above. However, for categorical variables which have distinct values, such a description seldom makes any sense. For instance, the mean of a categorical variable would barely be of any value.

 

For such cases, we use yet another pandas operation called ‘value_counts()’. The usability of this function can be demonstrated through our target variable ‘y’. y was extracted in the following manner:

y = data[‘target’] #extract

This is done since the iris dataset is in dictionary format and stores the target variable in a list corresponding to the key named as ‘target’. After the extraction is completed, convert the data into a pandas Series. This must be done as the function value_counts() is only applicable to pandas Series.

y = pd.Series(y) #convert


y.value_counts()

On applying the function, we get the following result:

Output:

2    50

1    50

0    50

dtype: int64

 

This means that the categories, ‘0’, ‘1’ and ‘2’ have an equal number of counts which is 50. The equal representation means that there will be minimum bias during training. For example, if data tends to have more records representing one particular category ‘A’, the training model used will tend to learn that the category ‘A’ is the most recurrent and will have the tendency to predict a record as record ‘A’. When unequal representations are found, any one of the following must be followed:

  • Gather more data
  • Generate samples
  • Eliminate samples

Now let us move on to visual techniques to analyze the same data, but reveal further hidden patterns!

 

2.  Visualization Based Analysis

 

Even though a descriptive analysis is highly informative, it does not quite furnish details with regard to the pattern that might arise in the variable. With the difference between the mean and median we may be able to figure out the presence of skewed data, but will not be able to pinpoint the exact reason owing to this skewness. This is where visualizations come into the picture and aid us to formulate solutions for the myriad patterns that might arise in the variables independently.

Lets start with observing the frequency distribution of sepal width in our dataset.

frequency distribution of sepal

Std: 0.435
Mean: 3.057
Median (50%): 3.000

 

The red dashed line represents the median and the black dashed line represents the mean. As you must have observed, the standard deviation in this variable is the least. Also, the difference between the mean and the median is not significant. This means that the data points are concentrated towards the median, and the distribution is not skewed. In other words, it is a nearly Gaussian (or normal) distribution. This is how a Gaussian distribution looks like:

Normal Distribution generation graph

Normal Distribution generated from random data

 

The data of the above distribution is generated through the random. The normal function of the numpy library (one of the python libraries to handle arrays and lists).

It must always be one’s aim to achieve a Gaussian distribution before applying modeling algorithms. This is because, as has been studied, the most recurrent distribution in real life scenarios is the Gaussian curve. This has led to the designing of algorithms over the years in such a way that they mostly cater to this distribution and assume beforehand that the data will follow a Gaussian trend. The solution to handle this is to transform the distribution accordingly.

Let us visualize the other variables and understand what the distributions mean.

Sepal Length:

image result for distribution mean graph

Std: 0.828
Mean: 5.843
Median: 5.80

 

As is visible, the distribution of Sepal Length is over a wide range of values (4.3cm to 7.9cm) and thus, the standard deviation for sepal length is higher than that of sepal width. Also, the mean and median have almost an insignificant difference between them. This clarifies that the data is not skewed. However, here visualization comes to great use because we can clearly see that distribution is not perfectly Gaussian since the tails of the distribution have ample data. In Gaussian distribution, approximately 5% of the data is present in the tailing regions. From this visualization, however, we can be sure that the data is not skewed.

Petal Length:

petal length graph

Std: 1.765
Mean: 3.758
Median: 4.350

This is a very interesting graph since we found an unexpected gap in the distribution. This can either mean that the data is missing or the feature does not apply to that missing value. In other words, the petal lengths of iris plants never have the length in the range 2 to 3! The mean is thus, justifiably inclined towards the left and the median shows the centralized value of the variable which is towards the right since most of the data points are concentrated in a Gaussian curve towards the right.  If you move on to the next visual and observe the pattern of petal width, you will come across an even more interesting revelation.

 

Petal Width:

petal width graph

std: 0.762
mean: 1.122
median: 1.3

In the case of Petal Width, most of the values in the same region as in the petal length diagram, relative to the frequency distribution, are missing. Here the values in the range 0.5 cm to 1.0 cm are almost absent (but not completely absent). A repetitive low value simultaneously in the same area corresponding to two different frequency distributions is indicative of the fact that the data is missing and also confirmatory of the fact that petals of the size of the missing values are present in nature, but went unrecorded.

This conclusion can be followed with further data gathering or one can simply continue to work with the limited data present since it is not always possible to gather data representing every element of a given subject.

Conclusively, using histograms we came to know about the following:

  • Data distribution/pattern
  • Skewed distribution or not
  • Missing data

Now with the help of another univariate analysis tool, we can find out if our data is inlaid with anomalies or outliers. Outliers are data points which do not follow the usual pattern and have unpredictable behavior. Let us find out how to find outliers with the help of simple visualizations!

We will use a plot called the Box plot to identify the features/columns which are inlaid with outliers.

Box Plot for Iris Dataset
Box Plot for Iris Dataset

 

The box plot is a visual representation of five important aspects of a variable, namely:

  • Minimum
  • Lower Quartile
  • Median
  • Upper Quartile
  • Maximum

As can be seen from the above graph, each variable is divided into four parts using three horizontal lines. Each section contains approximately 25% of the data.  The area enclosed by the box is 50% of the data which is located centrally and the horizontal green line represents the median. One can identify an outlier if the point is spotted beyond the max and min lines.

From the plot, we can say that sepal_width has outlying points. These points can be handled in two ways:

  • Discard the outliers
  • Study the outliers separately

Sometimes outliers are imperative bits of information, especially in cases where anomaly detection is a major concern. For instance, during the detection of fraudulent credit card behavior, detection of outliers is all that matters.

 

Conclusion

 

Overall, EDA is a very important step and requires lots of creativity and domain knowledge to dig up maximum patterns from available data. Keep following this space to know more about bi-variate and multivariate analysis techniques. It only gets interesting from here on!

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course, which is a step further into advanced data analysis and processing!

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs

 

Learn Data Science with the Best Available Free Courses Online

Learn Data Science with the Best Available Free Courses Online

Data Scientist Training Free of Charge

Now, in theory, it is possible to become a data scientist, without paying a dime. What we want to do in this article is to list out the best of the best options to learn what you need to know to become a data scientist. Many articles offer 4-5 courses under each heading. What I have done is to search through the Internet covering all free courses and choose the single best course for each topic.

These courses have been carefully curated and offer the best possible option if you’re learning for free. However – there’s a caveat. An interesting twist to this entire story.  Interested? Read on! And please – make sure you complete the full article.

Topics For A Data Scientist Course

The basic topics that a data scientist needs to know are:

  1. Machine Learning Theory and Applications
  2. Python Programming
  3. R Programming
  4. SQL
  5. Statistics & Probability
  6. Linear Algebra
  7. Calculus Basics (short)
  8. Machine Learning in Python
  9. Machine Learning in R
  10. Tableau

So let’s get to it. Here is the list of the best possible options to learn every one of these topics, carefully selected and curated.

 

Machine Learning – Stanford University – Andrew Ng (audit option)

Machine Learning Course From Stanford University

Machine learning course

The world-famous course for machine learning with the highest rating of all the MOOCs in Coursera, from Andrew Ng, a giant in the ML field and now famous worldwide as an online instructor. Uses MATLAB/Octave. From the website:

This course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Topics include:

(i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks)

(ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning)

(iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI)

The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

This course is extremely effective and has many benefits. However, you will need high levels of self-discipline and self-motivation. Statistics show that 90% of those who sign up for a MOOC without a classroom or group environment never complete the course.

 

Learn Python The Hard Way – Zed Shaw – Free Online Access

 

learn python

 

Learn Python The Hard Way Online Access

You may ask me, why do I want to learn the hard way? Shouldn’t we learn the smart way and not the hard way? Don’t worry. This ebook, online course, and web site is a highly popular way to learn Python. Ok,  so it says the hard way. Well, the only way to learn how to code is to practice what you have learned. This course integrates practice with learning. Other Python books you have to take the initiative to practice.

Here, this book shows you what to practice, how to practice. There is only one con here – although this is the best self-driven method, most people will not complete all of it. The main reason is that there is no external instructor for supervision and a group environment to motivate you. However, if you want to learn Python by yourself, then this is the best way. But not the optimal one, as you will see at the end of this article since the cost of the book is 30$ USD (2100 INR approx).

Interactive R and Data Science Programming – SwiRl

Interactive R and Data Science Course (In Console)

 

interactive R programming course Swirl

 

Swirlstats is a wonderful tool to learn R and data science scripting in R interactively and intuitively by teaching you R commands from within the R console. This might seem like a very simple tool, but as you use it, you will notice its elegance in teaching you literally how to express yourselves in R and the finer nuances of the language and integration with the console and tidyverse. This is a powerful method of learning R and what is more, it is also a lot of fun!

Descriptive and Inferential Statistics

Course on Statistics and Probability from KhanAcademy

 

khanacademy's profile picture

KhanAcademy is a free non-profit organization on a mission – they want to provide a world-class education to you regardless of where you may be in the world. And they’re doing a fantastic job! This course has been covered in several very high profile blogs and Quora posts as the best online course for statistics – period. What is more, it is extremely high quality and suitable for beginners –  and – free! This organization is doing wonderful work. More power to them!

Mathematics for Data Science

Now the basic mathematics for data science content includes linear algebra, single-variable, discrete mathematics, and multivariable calculus (selected topics) and basics of differential equations.  Now you could take all of these topics separately in KhanAcademy and that is a good option for Linear Algebra and Multivariate Calculus (in addition to Statistics and Probability).

For Linear Algebra, the link of what you need to know given in a course in KhanAcademy is given below:

Course on Linear Algebra From KhanAcademy

Course view with khan academy

 

For Multivariate Calculus

Course on MultiVariate Calculus From KhanAcademy

Mutlivariate calcus from khan academy

These courses are completely free and very accessible to beginners.

Discrete Mathematics

This topic deserves a section to itself because discrete mathematics is the foundation of all computer science. There are a variety of options available to learn discrete mathematics, from ebooks to MOOCs, but today, we’ll focus on the best possible option. MIT (Massachusetts Institute of Technology) is known as one of the best colleges in the world and they have an Open information initiative known as MIT OpenCourseWare (MIT OCW). These are actual videos of the lectures taken by the students at one of the best engineering colleges in the world. You will benefit a lot if you follow the lectures at this link, they give all the basic concepts as clearly as possible. It’s a bit technical because this is open mostly for students at an advanced level. The link is given below:

MIT OpenCourseWare Course: Mathematics for Computer Science

Image result for MIT OCW logo

For beginners, one slightly less technical option is the following course:

Course on Discrete Mathematics for Computer Science

It is also technical and from MIT but might be a little more accessible than the earlier option.

SQL

SQL (see-quel) or Structured Query Language is a must-learn if you are a data scientist. You will be working with a lot of databases, and SQL is the language used to access and generate data from database systems like Oracle and Microsoft SQL Server. The best free course I could find online is undoubtedly the one below:

Udemy Course for SQL Beginners

 

SQL for Newcomers - A Crash Course

SQL For Newcomers – A Free Crash Course from Udemy.com.

5 hours-plus of every SQL command and concept you need to know. And – completely free.

Machine Learning with Scikit-Learn

 

logo for Scikit

scikit learning course

 

Scikit-Learn Online Documentation Main Page

We have covered Python, R, Machine Learning using MATLAB, Data Science with R (SwiRl teaches data science as well), Statistics, Probability, Linear Algebra, and Basic Calculus. Now we just need to get a course for Data Science with Python, and we are done! Now I looked at many options but was not satisfied. So instead of a course, I have provided you with a link to the scikit-learn documentation. Why?

Because that’s as good as an online course by itself. If you read through the main sections, get the code (Ctrl-X, Ctrl-V) and execute it in an Anaconda environment, and then play around with it, experiment, and observe and read up on what every line does, you will already know who to solve standard textbook problems. I recommend the following order:

  1. Classification
  2. Regression
  3. Clustering
  4. Preprocessing
  5. Model Evaluation
  6. 5 classification examples (execute)
  7. 5 regression examples (run them)
  8. 5 clustering examples (ditto)
  9. 6 sample preprocessing functions
  10. Dimensionality Reduction
  11. Model Selection
  12. Hyperparameter Tuning

Machine Learning with R

 

Logo for Oreilly's R for Dsta Science course

 

Online Documentation for Machine Learning in R with Tidyverse

This book is free to learn online. Get the data files, get the script files, use RStudio, and just as with Python, play, enjoy, experiment, execute, and explore. A little hard work will have you up and running with R in no time! But make sure you try as many code examples as possible. The libraries you can focus on are:

  1. dplyr (data manipulation)
  2. tidyr (data preprocessing “tidying”)
  3. ggplot2 (graphical package)
  4. purrr (functional toolkit)
  5. readr (reading rectangular data files easily)
  6. stringr (string manipulation)
  7. tibble (dataframes)

Tableau

To make it short, simple, and sweet, since we have already covered SQL and this content is for beginners, I recommend the following course:

Udemy Course on Tableau for Beginners

This is a course on Udemy rated 4.2/5 and completely free. You will learn everything you need to work with Tableau (the most commonly used corporate-level visualization tool). This is an extremely important part of your skill set. You can make all the greatest analyses, but if you don’t visualize them and do it well, management will never buy into your machine learning solution, and neither will anyone who doesn’t know the technical details of ML (which is a large set of people on this planet). Visualization is important. Please make sure to learn the basics (at least!) of Tableau.

Tableau course image

From Unsplash

 

Kaggle Micro-Courses (Add-Ons – Short Concise Tutorials)

Kaggle Micro-Courses (from www.kaggle.com!)

Kaggle Micro-Courses (from www.kaggle.com!)

 

Kaggle Learn Home Page

Kaggle is a wonderful site to practice your data science skills, but recently, they have added a set of hands-on courses to learn data science practicals. And, if I do say, so myself, it’s brilliant. Very nicely presented, superb examples, clear and concise explanations. And of course, you will cover more than we discussed earlier. Please, if you read through all the courses discussed so far in this article, and if you do just the courses at Kaggle.com, you will have spent your time wisely (though not optimally – as we shall see).

Kaggle Learn

Kaggle Learn

Dimensionless Technologies

 

Dimensonless technologies logo

Dimensionless Technologies

 

Now, if you are reading this article, you might have a fundamental question. This is a blog of a company that offers courses in data science, deep learning, and cloud computing. Why would we want to list all our competitors and publish it on our site? Isn’t that negative publicity?

Quite the opposite. 

This is the caveat we were talking about.

Our course is a better solution than every single option given above!

We have nothing to hide.

And we have an absolutely brilliant top-class product.

Every option given above is a separate course by itself.

And they all suffer from a very prickly problem – you need to have excellent levels of discipline and self-motivation to complete just one of the courses above – let alone all ten.

 

You also have no classroom environment, no guidance for doubts and questions, and you need to know the basics about programming.

Our product is the most cost-effective option in the market for learning data science, as well as the most effective methodology for everyone – every course is conducted live in a classroom environment from the comfort of your home. You can work at a standard job, spend two hours on the internet every day, do extra work and reading on weekends, and become a professional data scientist in 6 months time.

We also have personalized GitHub project portfolio creation, management, and faculty guidance. Not to mention individual attention for each student.

And IITians for faculty who also happen to have 9+ years of industry experience.

So when we say that our product is the best on the market, we really mean it. Because of the live session teaching of the classes, which no other option on the Internet today has.

 

Am I kidding? Absolutely not. And you can get started with Dimensionless Technologies Data Science with Python and R course for just 70-odd USD. Which is the most cost-effective option on the market!

And unlike all the 10 courses and resources detailed above, instead of doing 10 courses, you just need to do one single course, with the extracted meat of all that you need to know as a data scientist. And yes, we cover:

  1. Machine Learning
  2. Python Programming
  3. R Programming
  4. SQL
  5. Statistics & Probability
  6. Linear Algebra
  7. Calculus Basics
  8. Machine Learning in Python
  9. Machine Learning in R
  10. Tableau
  11. GitHub Personal Project Portfolio Creation
  12. Live Remote Daily Sessions
  13. Experts with Industrial Experience
  14. A Classroom Environment (to keep you motivated)
  15. Individual Attention to Every Student

I hope this information has you seriously interested. Please sign up for the course – you will not regret it.

And we even have a two-week trial for you to experience the course for yourself.

Choose wisely and optimally.

Unleash the data scientist within!

 

An excellent general article on emerging state-of-the-art technology, AI, and blockchain:

The Exciting Future with Blockchain and Artificial Intelligence

For more on data science, check out our blog:

Blog

And of course, enjoy machine learning!