022-33574735 / 9923170071 / 8108094992 info@dimensionless.in

WHY DO MULTI-VARIATE ANALYSIS

  1. Every data-set comprises of multiple variables, so we need to understand how the multiple variables interact with each other.
  2. After we understand uni-variate analysis – where we understand the behaviour of each distribution, and bi-variate analysis – where we understand how each variable relates to the other variables; we need to understand what behaviour change will happen in the trend on introduction of more variables.
  3. Multi-variate analysis has good application in clustering, where we need to visualize how multiple variables show different patterns in different clusters.
  4. When there are too many inter-correlated variables in the data, we’ll have to do a dimensionality reduction through techniques like Principal Component Analysis and Factor Analysis. We will cover Dimensionality Reduction Techniques in different post.

We will illustrate multi-variate analysis with the following case study:

Data:

data<-read.csv("https://storage.googleapis.com/dimensionless/Blog/cust.csv")

Each row corresponds to annual spending by different customers of a whole sale distributor who sells milk / fresh grocery frozen detergent papers and delicassen in 3 different regions – Linson, Aporto and Others (Coded 1/2/3 respectively) through 2 different channels – Horeca (Hotel / Restaurant / Cafe) or Retail Channel (Coded 1/2 respectively)

head(data)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451
summary(data)
##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

PROCEDURE TO ANALYZE MULTIPLE VARIABLES

I. TABLES

Tables can be generated using xtabs function, tapply function, aggregate function and dplyr library

To get the spending on milk channel-wise and region-wise, using xtabs function

t=xtabs(data$Milk~data$Channel+data$Region)
To get percentage spending
round(t/sum(data$Milk, na.rm=T),2)
##             data$Region
## data$Channel    1    2    3
##            1 0.09 0.03 0.29
##            2 0.08 0.07 0.45

To get %age spending on grocery channel-wise and region-wise, using aggregate function

agg=aggregate(data$Grocery, by=list(Channel=data$Channel, Region=data$Region), sum,na.rm=T)
names(agg)[3]="Grocery"
agg$Ptage_Expense=round(agg$Grocery/sum(data$Grocery, na.rm=TRUE),2)
agg
##   Channel Region Grocery Ptage_Expense
## 1       1      1  237542          0.07
## 2       2      1  332495          0.10
## 3       1      2  123074          0.04
## 4       2      2  310200          0.09
## 5       1      3  820101          0.23
## 6       2      3 1675150          0.48

To get %age spending on frozen channel-wise and region-wise, using tapply function

b=tapply(data$Frozen, list(Region=data$Region, Channel=data$Channel),sum , na.rm=TRUE) 
Percentage spending
round(b/sum(data$Frozen, na.rm=TRUE),2)
##       Channel
## Region    1    2
##      1 0.14 0.03
##      2 0.12 0.02
##      3 0.57 0.12

To get %age spending on detergent_paper channel-wise and region-wise, using dplyr library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data%>%group_by(Channel,Region)%>%summarise(Detergents_Paper=sum(Detergents_Paper))->agg
agg$Ptage_Expense=round(agg$Detergents_Paper/sum(data$Detergents_Paper, na.rm = TRUE),2)
agg
## Source: local data frame [6 x 4]
## Groups: Channel [?]
## 
##   Channel Region Detergents_Paper Ptage_Expense
##     <int>  <int>            <int>         <dbl>
## 1       1      1            56081          0.04
## 2       1      2            13516          0.01
## 3       1      3           165990          0.13
## 4       2      1           148055          0.12
## 5       2      2           159795          0.13
## 6       2      3           724420          0.57

II. STATISTICAL TESTS

Anova

Anova can be used to understand, how a continuous variable is dependent on categorical independent variables.

In the following code we are trying to understand if sales of milk is a function of Region and Channel and their interaction.

res<-aov(Milk~Region + Channel + Region:Channel,data=data)
summary(res)
##                 Df    Sum Sq   Mean Sq F value Pr(>F)    
## Region           1 2.493e+07 2.493e+07   0.577  0.448    
## Channel          1 5.051e+09 5.051e+09 116.987 <2e-16 ***
## Region:Channel   1 1.134e+07 1.134e+07   0.263  0.609    
## Residuals      436 1.882e+10 4.318e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows that expense of milk is dependent on channel

Chi-Square Test

Chisquare Test to understand the association between 2 factor variables

dat=table(Channel=data$Channel, Region=data$Region)
chisq.test(dat)
## 
##  Pearson's Chi-squared test
## 
## data:  dat
## X-squared = 4.3491, df = 2, p-value = 0.1137

Probability is very high, 11.37%, hence we fail to reject the null hypothesis. Hence, we conclude that there is no association between channel and region.

III. CLUSTERING

Multi-Variate analysis has a very wide application in unsupervised learning. Clustering has the maximum applications of multi-variate understanding and visualizations. Many times we prefer to perform clustering before applying the regression algorithms to get more accurate predictions for each cluster.

We will do hierarchical clustering for our case study, using the following steps:

1. Seperating the columns to be analyzed

Let’s get a sample data comprising of all the items whose expenditure is to be analyzed i.e all columns except Channel and Region – like fresh, milk, grocery, frozen etc.

names(data)
## [1] "Channel"          "Region"           "Fresh"           
## [4] "Milk"             "Grocery"          "Frozen"          
## [7] "Detergents_Paper" "Delicassen"
sample<-data[,3:8]
names(sample)
## [1] "Fresh"            "Milk"             "Grocery"         
## [4] "Frozen"           "Detergents_Paper" "Delicassen"

2. Scaling the data, to get all the columns into same scale. This is done using calculation of z-score:

sample_scale=scale(sample, center=TRUE, scale=TRUE)
sample=cbind(sample, sample_scale)

3. Identifying the appropriate number of clusters for k-means clustering

library(NbClust)
noculs <- NbClust(sample_scale, distance = "euclidean", 
                  min.nc = 2, max.nc = 12, method = "kmeans") 
multi variate analysis
## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 
multi variate analysis
## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## * 3 proposed 12 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
table(noculs$Best.nc[1,])
## 
##  0  2  3  4  5  7 10 12 
##  2  6  4  3  3  1  4  3
barplot(table(noculs$Best.nc[1,]), xlab="Numer of Clusters", ylab="Number of Criteria",
 main="Number of Clusters Chosen")
multi variate analysis

Though 2 clusters / 3 clusters show the maximum variance. In this case-study we are deviding the data into 10 clusters to get more specific results, visualizations and target strategies.

We can also use within-sum-of-squares method to find the number of clusters.

Also read:
Data Exploration and Uni-Variate Analysis
Bi-Variate Analysis
Data-Cleaning, Categorization and Normalization

4. Finding the most suitable number of clusters through wss method

wss<-1:15
for (i in 1:15)
{
  wss[i]<-kmeans(sample[,7:12],i)$tot.withinss
}
wss
##  [1] 2634.0000 1949.3479 1638.5026 1353.8722 1240.8142 1125.9104  862.4485
##  [8]  773.6659  689.2003  597.5906  564.9160  526.6801  482.9415  481.6230
## [15]  468.0860

5. Plot wss using ggplot2 Library

We will plot the within-sum-of-squares distance using ggplot library:

number<-1:15

library(ggplot2)
dat<-data.frame(wss,number)
dat 
##          wss number
## 1  2634.0000      1
## 2  1949.3479      2
## 3  1638.5026      3
## 4  1353.8722      4
## 5  1240.8142      5
## 6  1125.9104      6
## 7   862.4485      7
## 8   773.6659      8
## 9   689.2003      9
## 10  597.5906     10
## 11  564.9160     11
## 12  526.6801     12
## 13  482.9415     13
## 14  481.6230     14
## 15  468.0860     15
p<-ggplot(dat,aes(x=number,y=wss),color="red")
p+geom_point()+scale_x_continuous(breaks=seq(1,20,1))+scale_y_continuous(breaks=seq(500,3000,500))
multi variate analysis

We notice that after cluster 10, the wss distance increases drastically. So we can choose 10 clusters.

5. Dividing data into 10 clusters

We will apply kmeans algorithm to divide the data into 10 clusters:

set.seed(200)
fit.km<-kmeans(sample[,7:12],10)
sample$cluster=fit.km$cluster

6. Checking the Attributes of k-means Object

We will check the centers and size of the clusters

attributes(fit.km)
## $names
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## $class
## [1] "kmeans"
fit.km$centers
##         Fresh       Milk     Grocery      Frozen Detergents_Paper
## 1   0.7918828  0.5610464 -0.01128859  9.24203651       -0.4635194
## 2   0.7108765 -0.2589259 -0.27203848 -0.20929524       -0.3530279
## 3   1.9645810  5.1696185  1.28575327  6.89275382       -0.5542311
## 4   1.0755395  5.1033075  5.63190631 -0.08979632        5.6823687
## 5  -0.4634131 -0.4587350 -0.52211820 -0.28548405       -0.4501496
## 6  -0.5526133  0.4194314  0.51079792 -0.30950259        0.4915742
## 7   0.2370851 -0.2938321 -0.42497950  1.43457516       -0.4926072
## 8   3.0061448  1.6650889  0.98706324  1.09668928        0.1840989
## 9   2.7898027 -0.3572603 -0.37008492  0.46561695       -0.4540498
## 10 -0.5039508  1.4484898  1.97086631 -0.27885432        2.2070700
##      Delicassen
## 1   0.932103121
## 2   0.051489721
## 3  16.459711293
## 4   0.419817401
## 5  -0.269307346
## 6  -0.010691096
## 7  -0.019489863
## 8   4.237120807
## 9  -0.008907628
## 10  0.207301087
fit.km$size
##  [1]   2  78   1   5 167  93  43   4  20  27

7. Visualizing the Clusters

library(cluster)
clusplot(sample, fit.km$cluster, main='2D representation of the Cluster solution',
         color=TRUE, shade=TRUE, labels=2, lines=0)
multi variate analysis

7. Profiling Clusters

Getting Cluster-wise summaries through mean function

cmeans<-aggregate(sample[,c(1:6)], by=list(sample$cluster), FUN=mean)
cmeans
##    Group.1     Fresh      Milk   Grocery    Frozen Detergents_Paper
## 1        1 22015.500  9937.000  7844.000 47939.000         671.5000
## 2        2 20990.987  3885.295  5366.051  2055.872        1198.3077
## 3        3 36847.000 43950.000 20170.000 36534.000         239.0000
## 4        4 25603.000 43460.600 61472.200  2636.000       29974.2000
## 5        5  6139.359  2410.629  2989.503  1686.000         735.2455
## 6        6  5011.215  8891.828 12805.473  1569.398        5225.2473
## 7        7 14998.791  3627.674  3912.628 10036.326         532.8140
## 8        8 50020.000 18085.250 17331.500  8396.000        3759.2500
## 9        9 47283.850  3159.550  4434.300  5332.350         716.6500
## 10      10  5626.667 16486.667 26680.741  1718.185       13404.4815
##    Delicassen
## 1   4153.5000
## 2   1670.0769
## 3  47943.0000
## 4   2708.8000
## 5    765.3952
## 6   1494.7204
## 7   1469.9070
## 8  13474.0000
## 9   1499.7500
## 10  2109.4815

8. Population-Wise Summaries

options(scipen=999)
popln_mean=apply(sample[,1:6],2,mean)
popln_sd=apply(sample[,1:6],2,sd)
popln_mean
##            Fresh             Milk          Grocery           Frozen 
##        12000.298         5796.266         7951.277         3071.932 
## Detergents_Paper       Delicassen 
##         2881.493         1524.870
popln_sd
##            Fresh             Milk          Grocery           Frozen 
##        12647.329         7380.377         9503.163         4854.673 
## Detergents_Paper       Delicassen 
##         4767.854         2820.106

9. Z-Value Normalisation

z score = (cluster_mean-population_mean)/population_sd

list<-names(cmeans)
for(i in 1:length(list))
{
  y<-(cmeans[,i+1] - popln_mean[i])/popln_sd[i]
  cmeans<-cbind(cmeans,y)
  names(cmeans)[i+length(list)]<-paste("z",list[i+1],sep="_")
}
cmeans=cmeans[,-length(names(cmeans))]
cmeans[8:length(names(cmeans))]
##       z_Fresh     z_Milk   z_Grocery    z_Frozen z_Detergents_Paper
## 1   0.7918828  0.5610464 -0.01128859  9.24203651         -0.4635194
## 2   0.7108765 -0.2589259 -0.27203848 -0.20929524         -0.3530279
## 3   1.9645810  5.1696185  1.28575327  6.89275382         -0.5542311
## 4   1.0755395  5.1033075  5.63190631 -0.08979632          5.6823687
## 5  -0.4634131 -0.4587350 -0.52211820 -0.28548405         -0.4501496
## 6  -0.5526133  0.4194314  0.51079792 -0.30950259          0.4915742
## 7   0.2370851 -0.2938321 -0.42497950  1.43457516         -0.4926072
## 8   3.0061448  1.6650889  0.98706324  1.09668928          0.1840989
## 9   2.7898027 -0.3572603 -0.37008492  0.46561695         -0.4540498
## 10 -0.5039508  1.4484898  1.97086631 -0.27885432          2.2070700
##    z_Delicassen
## 1   0.932103121
## 2   0.051489721
## 3  16.459711293
## 4   0.419817401
## 5  -0.269307346
## 6  -0.010691096
## 7  -0.019489863
## 8   4.237120807
## 9  -0.008907628
## 10  0.207301087

Where-ever we have very high z-scores it indicates, that cluster is different from the population. * Very-high z-score for fresh in cluster 8 and 9
* Very-high z-score for milk in cluster 5,6 and 9
* Very-high z-score for grocery in cluster 5 and 6
* Very-high z-score for frozen products in cluster 7, 9 and 10
* Very-high z-score for detergents paper in cluster 5 and 6

We would like to find why these clusters are so different from the population

IV. MULTI-VARIATE VISUALIZATIONS

  1. To understand the correlations between each column
pairs(data[,-c(1,2,length(names(data)))])
multi variate analysis

We observe positive correlation between:

  • Milk & Grocery
  • Milk & Detergents_Paper
  • Grocery & Detergents_Paper

Next we will import the ggplot2 library to do the graphical representations of data data-frame.

We’ll also add the column cluster number to the data-frame object “data”.

library(ggplot2)
data=cbind(data, Cluster=sample[,length(sample)])
data$Cluster=as.factor(data$Cluster)

Next we will check the cluster-wise views and how the patterns differ cluster-wise.

Milk vs Grocery vs Fresh cluster wise analysis

library(RColorBrewer)
p<-ggplot(data,aes(x=Milk,y=Grocery, size=Fresh))+scale_colour_brewer(palette = "Paired")
p+geom_point(aes(colour=Cluster))
multi variate analysis
  • We notice that if expenditure on milk is high, expenditure on grocery or fresh is high, but not both
  • We notice cluster 4 contains data points on the high end of milk or grocery
  • Cluster 3 has got people with high spending on milk and average spending on grocery
Relationship between Milk, Grocery and Fresh across Region across Channel
p+geom_point(aes(colour=Cluster))+facet_grid(Region~Channel)
multi variate analysis
  • Region 3 has more people than Region 1 and 2
  • In Region 3 we observe an increasing trend between milk and fresh and grocery
  • In Region 1 we notice that there is an increasing trend between milk and grocery but fresh is low
  • In Region 2 we notice medium purchase of milk and grocery and fresh
  • High milk / grocery sales and medium fresh sales is through channel 2
  • In channel 2 there is an increasing trend between consumption of milk and consumption of grocery
  • Cluster 4 has either high sales of milk or grocery or both
  • Channel 2 contributes to high sales of milk and grocery, while low and medium sales of fresh

Milk vs Grocery vs Frozen Products Cluster wise analysis

library(RColorBrewer)
p<-ggplot(data,aes(x=Milk,y=Grocery, size=Frozen))+scale_colour_brewer(palette = "Paired")
p+geom_point(aes(colour=Cluster))
multi variate analysis
  • Very high sales of frozen products by cluster 11 and cluster 7
  • People purchasing high quantities of milk and grocery are purchasing low quantities of frozen products
Relationship between Milk, Grocery and Frozen Products across Region
p+geom_point(aes(colour=Cluster))+
  facet_grid(Region~.)
multi variate analysis
  • In Region 2 and Region 3, we have clusters 1 and 3 respectively, which have high expenditure pattern on frozen products
Relationship between Milk, Grocery and Frozen across Channel
p+geom_point(aes(colour=Cluster))+facet_grid(Channel~.)
multi variate analysis
  • We notice that channel 1 has many people with high purchase pattern of frozen products
  • Channel 2 has some clusters (cluster no.: 5 and 6) with very high purchase pattern of milk

Relationship between Frozen Products, Grocery and Detergents Paper across Region across Channel

p<-ggplot(data,aes(x=Grocery,y=Frozen, size=Detergents_Paper))+scale_colour_brewer(palette = "Paired")
p+geom_point(aes(colour=Cluster))+facet_grid(Region~Channel)
multi variate analysis
  • In channel-2, people who are spending high on grocery are also spending low on frozen
  • High sales of detergents paper and grocery are observed through channel 2
  • Sales of frozen products is almost nil through channel 2
  • Cluster 4 has high expenditure on Detergents_Paper
  • Through channel 2 sales of frozen products is 0

Relationship between Milk, Delicassen and Detergents Paper across Region

p<-ggplot(data,aes(x=Milk,y=Delicassen))
p+geom_point(aes(colour=Cluster,size=Detergents_Paper))+
  facet_grid(Region~.) 
  • People who spend high on milk hardly spend on Delicassen, though in region 3 we do see comparitively more expenditure on Delicassen
  • Cluster 3 in region 3 has very high expenditure on delicassen and high expenditure on milk
  • Cluster 4 has high consumption pattern on milk and detergents paper

Relationship between Milk, Grocery and Detergents Paper across Channel

p<-ggplot(data,aes(x=Milk,y=Grocery))
p+geom_point(aes(colour=Cluster, size=Detergents_Paper))+
  facet_grid(Channel~.) 
multi variate analysis
  • Channel 2 is having an increasing trend between milk and Detergents Paper
  • Where sales of detergents paper is high, the sales of milk is also high
  • Channel 4 has high expense pattern on Detergents Paper or Milk
Relationship between Milk, Grocery and Detergents Paper across Region across Channel
p+geom_point(aes(colour=Cluster,size=Detergents_Paper))+facet_grid(Region~Channel)
multi variate analysis
  • Channel 2 is having an increasing trend between milk and Detergents Paper
  • Where sales of detergents paper is high, the sales of milk is also high
  • Channel 4 has high expense pattern on Detergents Paper or Milk
Relationship between Milk, Grocery and Detergents Paper across Region across Channel
p+geom_point(aes(colour=Cluster,size=Detergents_Paper))+facet_grid(Region~Channel) 
Channel 2 is having an increasing trend between milk and Detergents Paper Where sales of detergents paper is high, the sales of milk is also high Channel 4 has high expense pattern on Detergents Paper or Milk Relationship between Milk, Grocery and Detergents Paper across Region across Channel  p+geom_point(aes(colour=Cluster,size=Detergents_Paper))+facet_grid(Region~Channel) 
 
  • There is a linear trend between Milk and Grocery in channel 2
  • There is a linear trend between Grocery and Detergent Paper
  • Channel 4 has high comption of grocery and detergents paper or grocery
  • Cluster 10 has medium consumption of milk, grocery and detergents paper
  • Cluster 6 has low consumption of milk and grocery and detergents paper
  • Cluster 2 has lowest consumption of milk grocery and detergents paper

Based on the above understanding of cluster-wise trends, we can devise cluster-wise, region-wise, channel-wise strategies to improve the sales.

V. DIMENSIONALITY REDUCTION TECHNIQUES

We use dimensionality reduction techniques like PCA to transform larger number of independent variables into a smaller set of variables:

Principal Component Analysis

Principal component analysis (PCA) tries to explain the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are: data reduction and interpretation. Principal components is often more effective in summarizing the variability in a set of variables when these variables are highly correlated.

Also, PCA is normally an intermediate step in the data analysis since the new variables created (the predictions) can be used in subsequent analysis such as multivariate regression and cluster analysis.

We will discuss PCA in my further posts.

Share
Share
Social Media Auto Publish Powered By : XYZScripts.com