9923170071 / 8108094992 info@dimensionless.in

MULTI-VARIATE ANALYSIS

WHY DO MULTI-VARIATE ANALYSIS

  1. Every data-set comprises of multiple variables, so we need to understand how the multiple variables interact with each other.
  2. After we understand uni-variate analysis – where we understand the behaviour of each distribution, and bi-variate analysis – where we understand how each variable relates to the other variables; we need to understand what behaviour change will happen in the trend on introduction of more variables.
  3. Multi-variate analysis has good application in clustering, where we need to visualize how multiple variables show different patterns in different clusters.
  4. When there are too many inter-correlated variables in the data, we’ll have to do a dimensionality reduction through techniques like Principal Component Analysis and Factor Analysis. We will cover Dimensionality Reduction Techniques in different post.

We will illustrate multi-variate analysis with the following case study:

Data:

Each row corresponds to annual spending by different customers of a whole sale distributor who sells milk / fresh grocery frozen detergent papers and delicassen in 3 different regions – Linson, Aporto and Others (Coded 1/2/3 respectively) through 2 different channels – Horeca (Hotel / Restaurant / Cafe) or Retail Channel (Coded 1/2 respectively)

PROCEDURE TO ANALYZE MULTIPLE VARIABLES

I. TABLES

Tables can be generated using xtabs function, tapply function, aggregate function and dplyr library

To get the spending on milk channel-wise and region-wise, using xtabs function

To get percentage spending

To get %age spending on grocery channel-wise and region-wise, using aggregate function

To get %age spending on frozen channel-wise and region-wise, using tapply function

Percentage spending

To get %age spending on detergent_paper channel-wise and region-wise, using dplyr library

II. STATISTICAL TESTS

Anova

Anova can be used to understand, how a continuous variable is dependent on categorical independent variables.

In the following code we are trying to understand if sales of milk is a function of Region and Channel and their interaction.

This shows that expense of milk is dependent on channel

Chi-Square Test

Chisquare Test to understand the association between 2 factor variables

Probability is very high, 11.37%, hence we fail to reject the null hypothesis. Hence, we conclude that there is no association between channel and region.

III. CLUSTERING

Multi-Variate analysis has a very wide application in unsupervised learning. Clustering has the maximum applications of multi-variate understanding and visualizations. Many times we prefer to perform clustering before applying the regression algorithms to get more accurate predictions for each cluster.

We will do hierarchical clustering for our case study, using the following steps:

1. Seperating the columns to be analyzed

Let’s get a sample data comprising of all the items whose expenditure is to be analyzed i.e all columns except Channel and Region – like fresh, milk, grocery, frozen etc.

2. Scaling the data, to get all the columns into same scale. This is done using calculation of z-score:

3. Identifying the appropriate number of clusters for k-means clustering

Though 2 clusters / 3 clusters show the maximum variance. In this case-study we are deviding the data into 10 clusters to get more specific results, visualizations and target strategies.

We can also use within-sum-of-squares method to find the number of clusters.

Also read:
Data Exploration and Uni-Variate Analysis
Bi-Variate Analysis
Data-Cleaning, Categorization and Normalization

4. Finding the most suitable number of clusters through wss method

5. Plot wss using ggplot2 Library

We will plot the within-sum-of-squares distance using ggplot library:

We notice that after cluster 10, the wss distance increases drastically. So we can choose 10 clusters.

5. Dividing data into 10 clusters

We will apply kmeans algorithm to divide the data into 10 clusters:

6. Checking the Attributes of k-means Object

We will check the centers and size of the clusters

7. Visualizing the Clusters

7. Profiling Clusters

Getting Cluster-wise summaries through mean function

8. Population-Wise Summaries

9. Z-Value Normalisation

z score = (cluster_mean-population_mean)/population_sd

Where-ever we have very high z-scores it indicates, that cluster is different from the population. * Very-high z-score for fresh in cluster 8 and 9
* Very-high z-score for milk in cluster 5,6 and 9
* Very-high z-score for grocery in cluster 5 and 6
* Very-high z-score for frozen products in cluster 7, 9 and 10
* Very-high z-score for detergents paper in cluster 5 and 6

We would like to find why these clusters are so different from the population

IV. MULTI-VARIATE VISUALIZATIONS

  1. To understand the correlations between each column

We observe positive correlation between:

  • Milk & Grocery
  • Milk & Detergents_Paper
  • Grocery & Detergents_Paper

Next we will import the ggplot2 library to do the graphical representations of data data-frame.

We’ll also add the column cluster number to the data-frame object “data”.

Next we will check the cluster-wise views and how the patterns differ cluster-wise.

Milk vs Grocery vs Fresh cluster wise analysis

  • We notice that if expenditure on milk is high, expenditure on grocery or fresh is high, but not both
  • We notice cluster 4 contains data points on the high end of milk or grocery
  • Cluster 3 has got people with high spending on milk and average spending on grocery
Relationship between Milk, Grocery and Fresh across Region across Channel

  • Region 3 has more people than Region 1 and 2
  • In Region 3 we observe an increasing trend between milk and fresh and grocery
  • In Region 1 we notice that there is an increasing trend between milk and grocery but fresh is low
  • In Region 2 we notice medium purchase of milk and grocery and fresh
  • High milk / grocery sales and medium fresh sales is through channel 2
  • In channel 2 there is an increasing trend between consumption of milk and consumption of grocery
  • Cluster 4 has either high sales of milk or grocery or both
  • Channel 2 contributes to high sales of milk and grocery, while low and medium sales of fresh

Milk vs Grocery vs Frozen Products Cluster wise analysis

  • Very high sales of frozen products by cluster 11 and cluster 7
  • People purchasing high quantities of milk and grocery are purchasing low quantities of frozen products
Relationship between Milk, Grocery and Frozen Products across Region

  • In Region 2 and Region 3, we have clusters 1 and 3 respectively, which have high expenditure pattern on frozen products
Relationship between Milk, Grocery and Frozen across Channel

  • We notice that channel 1 has many people with high purchase pattern of frozen products
  • Channel 2 has some clusters (cluster no.: 5 and 6) with very high purchase pattern of milk

Relationship between Frozen Products, Grocery and Detergents Paper across Region across Channel

  • In channel-2, people who are spending high on grocery are also spending low on frozen
  • High sales of detergents paper and grocery are observed through channel 2
  • Sales of frozen products is almost nil through channel 2
  • Cluster 4 has high expenditure on Detergents_Paper
  • Through channel 2 sales of frozen products is 0

Relationship between Milk, Delicassen and Detergents Paper across Region

  • People who spend high on milk hardly spend on Delicassen, though in region 3 we do see comparitively more expenditure on Delicassen
  • Cluster 3 in region 3 has very high expenditure on delicassen and high expenditure on milk
  • Cluster 4 has high consumption pattern on milk and detergents paper

Relationship between Milk, Grocery and Detergents Paper across Channel

  • Channel 2 is having an increasing trend between milk and Detergents Paper
  • Where sales of detergents paper is high, the sales of milk is also high
  • Channel 4 has high expense pattern on Detergents Paper or Milk
Relationship between Milk, Grocery and Detergents Paper across Region across Channel

  • Channel 2 is having an increasing trend between milk and Detergents Paper
  • Where sales of detergents paper is high, the sales of milk is also high
  • Channel 4 has high expense pattern on Detergents Paper or Milk
Relationship between Milk, Grocery and Detergents Paper across Region across Channel

  • There is a linear trend between Milk and Grocery in channel 2
  • There is a linear trend between Grocery and Detergent Paper
  • Channel 4 has high comption of grocery and detergents paper or grocery
  • Cluster 10 has medium consumption of milk, grocery and detergents paper
  • Cluster 6 has low consumption of milk and grocery and detergents paper
  • Cluster 2 has lowest consumption of milk grocery and detergents paper

Based on the above understanding of cluster-wise trends, we can devise cluster-wise, region-wise, channel-wise strategies to improve the sales.

V. DIMENSIONALITY REDUCTION TECHNIQUES

We use dimensionality reduction techniques like PCA to transform larger number of independent variables into a smaller set of variables:

Principal Component Analysis

Principal component analysis (PCA) tries to explain the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are: data reduction and interpretation. Principal components is often more effective in summarizing the variability in a set of variables when these variables are highly correlated.

Also, PCA is normally an intermediate step in the data analysis since the new variables created (the predictions) can be used in subsequent analysis such as multivariate regression and cluster analysis.

We will discuss PCA in my further posts.

Data Cleaning, Categorization and Normalization

Data Cleaning, Categorization and Normalization

Data Cleaning, categorization and normalization is the most important step towards the data. Data that is captured is generally dirty and is unfit for statistical analysis. It has to be first cleaned, standardized, categorized and normalized, and then explored.

Definition of Clean Data

Happy families are all alike; every unhappy family is unhappy in its own way – Leo Tolstoy

Like families, clean datasets are all alike but every messy dataset is unreadable by our modeling algorithms.Clean datasets provide a standardized way to link the structure of a dataset with its semantics.

We will take the following text file with dirty data:

%% Data
Sonu ,1861, 1892, male
Arun , 1892, M
1871, Monica, 1937, Female
1880, RUCHI, F
Geetu, 1850, 1950, fem
BaLa, 1893,1863
% Names, birth dates, death dates, gender

Let us start with tidying the above data. We’ll adopt following steps for the same:

1. Read the unclean data from the text file and analyse the structure, content, and quality of data.

The following functions lets us read the data that is technically correct or close to it:

  • read.table
  • read.csv
  • read.csv2
  • read.delim
  • read.delim2

When the rows in the data file are not uniformly formatted, we can consider reading in the text line-by-line and transforming the data to rectangular text ourself.

The variable txt is a vercot of type “character” having 8 elements, equal to number of lines in our txt file.

2. Delete the irrelevant/duplicate data. This improves data protection, increases the speed of processing and reduces the overall costs.

In our eg., we will delete the comments from the data. Comments are the lines followed by “%” sign using following code.

Following is the code:

3. Split lines into seperate fields. This can be done using strsplit function:

Here, txt was a vector of type characters, while, after splitting we have our output stored in the variable “fields” which is of “list” type.

4. Standardize and Categorize fields

It is a crucial process where the data is defined, formatted, represented and structured in all data layers. Development of schema or attributes is involved in this process.

The goal of this step is to make sure every row has same number of fields, and the fields are in the same order. In read.table command, any fields that are less than the maximum number of fields in a row get appended by NA. One advantage of do-it-yourself approach is that we don’t have to make this assumption. The easiest way to standardize rows is to write a function that takes a single character vector as input, and assigns the values in the right order.

The above function takes each line in the txt as the input (as x). The function returns a vector of class character in a standard order: Name, Birth Year, Death Year, Gender. Where-ever, the field will be missing, NA will be introduced. The grep statement is used to get the location of alphabetical characters in our input variable x. There are 2 kinds of alphabetical characters in our input string, one representing name and other representing sex. Both are accordingly assigned in the out vector at 1st and 4th positions. The year of birth and year of death can be recognized as “less than 1890” or “greater than 1890” respectively.

To retrieve the fields for each row in the example, we need to apply this function to every row of fields.

stdfields=lapply(fields, Categorize)

Our Categorize function here is quite fragile, it crashes for eg. when our input vector contains 3 or more character fields. Only the data analyst should determine how generalized should our categorize function be.

5. Transform to Data to the Frame type

First we will copy the elements of list to a matrix which is then coerced into a data-frame.

Here, we have made a matrix by row. The number of rows in the matrix will be same as number of elements in stdfields. There-after we have converted the matrix in data.frame formats.

6. Data Normalization

It is the systematic process to ensure the data structure is suitable or serves the purpose. Here the undesirable characteristics of the data are eliminated or updated to improve the consistency and the quality. The goal of this process is to reduce redundancy, inaccuracy and to organize the data.

String normalization techniques are aimed at transforming a variety of strings to a smaller set of string values which are more easily processed.

  • Remove the white spaces. We can use str_trim function from stringr library.

sapply function will return object of type matrix. So we again recovert it into data-frame type

  • Converting all the letters in the column “Name” to upper-case for standardization

  • Normalize the gender variable We have to normalize the gender variable. It is in 5 different formats. Following are the two ways to normalize gender variable:

One is using ^ operator. This will find the words in the Gender column which begin with m and f respectively.

Following is the code:

There is another method of approximate string matching using string distances. A string distance measures how much 2 strings differ from each other. It measures how many operations are required to turn one string to another. Eg.

So, here there are 2 operations required to convert pqr to qpr

  • replace q: pqr -> ppr
  • replace p: ppr -> qpr

We’ll use adist function in the gender column in the following manner:

First is male – 0 replacement away from “male” and 2 replacements away from “female”.
Second is M is 4 replacements away from “male” and 6 replacements away from “female”.
Third is Female 2 replacements away from “male” and 1 replacement away from “female”.

We can use which.min() function on each row to get minimum distance from the codes.

it contains the column number where the distance is minimum from our codes. If the distance from column number 1 is minimum, then we will substitute gender with code[1], and if the distance from column number 2 is minimum, then we will substitute the gender with code[2]. Following is the code:

  • Normalize the data-types

Let’s retrieve the classes of each column in data-frame at:

We will convert Birth and Death columns to Numeric data-types using transform function.

The data as represented in dat data frame is clean, standardized, categorized and normalized.

Cleaning the Date Variable

When we talk to data cleaning, we can’t miss cleaning and normalizing the date variable. Date variable mostly is generally keyed in by different people in different formats. So, we always face the problem in standardizing it.

In the following example, we’ll standardize and normalize the date variable:

We notice the date variable is in a mixed format. We will first substitute “/” and ” ” in the date variable with “-”, in our attempt to standardize dates.

In the below code, we will make the assumption that month can’t be in the first-2 or last-2 characters. The month in all cases comes in the middle of date and year; or year and date.

Next, we will write a function to find the first-two digits of each date:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[2]=“-”, then first_2 numbers will be 0 and dt[1]
  • We will convert the first_2 characters to numeric type.

Next, we will write a function to find the last-2 digits of each date:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[2]=“-”, then first_2 numbers will be 0 and dt[1]
  • We will convert the first_2 characters to numeric type.

Middle 2 digits can be alphabetic or numeric. We will write the function to find the middle 2 characters. We will adopt following steps to reach the middle characters:

  • Split the date string and store the output in dt. Eg. “110798” to dt = ‘1’ ‘1’ ‘0’ ‘7’ ‘9’ ‘8’
  • If dt[3]=“-”, then, dt[4:5] are the middle characters. But, if dt[5] is a “-”, then, dt[4] (single digit) is the middle character.
  • If dt[3]!=“-”, then, dt[3:4] are the middle characters. But, if dt[4] is a “-”, then, dt[3] (single digit)is the middle character.
  • If the middle character is a single digit, then we need to append 0 to the single digit, to return a 2 digit middle character.
  • Later in the code, we will evaluate if the middle character is an alphabetic or numeric, and process accordingly.

Next we will initialize variables before applying a for-loop to process and segregate the date, month and year for each date field.

  • df is a dummy data.frame object.
  • datf will be the final data frame object which will comprise of fields dd, mm, year, month format (i.e numeric or alphabetical) and concatenated date in the standard format.
  • snum refers to serial number of each date in the vector “date”.
  • We will keep the default date to be 01, default month to be 01 and default year to be 1990 incase of missing fields in these variables.
  • Mon_Format refers to whether month is in numeric or alphabetical format.
  • cat_date field contains date, month and year concatenated.
  • format_date will contain the cat_date converted to “date” format.

Now, for each date field, we will find the first-two numbers, the last-two numbers and the middle characters. Then we’ll apply following rules:

  • If number of characters in our date is 4, then this represents the year.
  • If first_2>31, then the first-two numbers represent the year.
  • If first_2<31, then, first_2 represents the date.
  • If 2nd number is a “-”, then 1st number represents the single digit date.
  • If last_2>31, then the last-two numbers represent the year.
  • If last_2<31, then the last-two numbers represent the date.
  • If the 2nd last number is a “-”, then the last number represents the single digit date.
  • If middle characters are non-alphabetical (i.e if middle characters are numeric), then mid-num represents the numeric middle characters. Mon_Format will be labelled numeric in such case.
  • If mid-num>31, then mid-num represents the year, else if mid-num<12, then mid-num represents the month.
  • If the date has an alphabetical character, then the first 3 alphabets represent the month. The Mon_Format will be alphabetical in this case.
  • Next we will concatenate the date in following format: dd-mm-yy, where mm can be numeric (with mon_format as “num”) or alphabetic (with mon_format as “alpha”) and store into variable “cat_date”.
  • We will change the format of cat_date to “date” data-type and store in the variable “formatted_date”.

Following is the code:

The vector formatted_date contains the date in the correct format.

The datal.frame datf comprises of all the entries segregated as date, month and year.

We note, that date vector now contains all the dates in the standard format. The class of variable date is “date”.

I would like to conclude this article by:

“Bad Data leads to Bad Decisions”

“Good Data improves Efficient Decisions”

Also read:

Data Exploration and Uni-Variate Analysis
Bi-Variate Analysis