9923170071 / 8108094992 info@dimensionless.in
Principal Component Analysis in R – Walk Through

Principal Component Analysis in R – Walk Through

Principal Component Analysis or PCA is one of the simplest and fundamental techniques used in machine learning. It is perhaps one of the oldest techniques available for dimensionality reduction, and thus, its understanding is of paramount importance for any aspiring Data Scientist/Analyst. An in-depth understanding of PCA in R will not only help in the implementation of effective dimensionality reduction but also help to build the foundation for development and understanding of other advanced and modern techniques.

Examples of Dimension Reduction from 2-D space to 1-D space
Courtesy: Bits of DNA

PCA aims to achieve two primary goals:

1. Dimensionality Reduction

Real-life data has several features generated from numerous resources. However, our machine learning algorithms are not adept enough to handle high dimensions efficiently. Feeding several features, all at once, almost always leads to poor results since the models cannot grasp and learn from such volume altogether. This is called the “Curse of Dimensionality” which leads to unsatisfactory results from the models implemented. Principal Component Analysis in R helps resolve this problem by projecting n dimensions to n-x dimensions (where x is a positive number), preserving as much variance as possible. In other words, PCA in R reduces the number of features by transforming the features into a lesser number of projections of themselves.

2. Visualization

Our visualization systems are limited to 2-dimensional space which prevents us from forming a visual idea of the high dimensional features in the dataset. PCA in R resolves this problem by projecting n dimensions to a 2-D environment, enabling sound visualization. These visualizations sometimes reveal a great deal about the data. For instance, the new feature projections may form clusters in the 2-D space which was previously not perceivable in higher dimensions.


Visualization with PCA (n-D to 2-D)
Courtesy: nlpca.org

Intuition

Principal Component Analysis in R works with the simple idea of projection of a higher space to a lower space or dimension

The two alternate objectives of Principal Component Analysis are:

1. Variance Maximization Formulation

2. Distance Minimization Formulation

Let us demonstrate the above with the help of simple examples. If you have 2 features, and you wish to reduce the features to a 1-D feature set using PCA in R, you must lookout for the direction with maximal spread/variance. This becomes the new direction on which every data point is projected. The direction perpendicular to this direction has the least variance, and is thus, discarded.

Alternately, if one focuses on the perpendicular distance between a data point and the direction of maximum variance, our objective shifts to the minimization of that distance. This is because, lesser the distance, higher is the authenticity of the projection.

On completion of these projections, you would have successfully transformed your 2-D data to a 1-D dataset.

Mathematical Intuition

Principal Component Analysis in R locates the distance of maximal spread (or direction of minimal distance from data points) with the use of Eigen Vectors and Eigen Values. Every Eigen Vector (Vi) corresponds to an Eigen Value (Ei).

If X is a feature matrix (matrix with the feature values),

covariance matrix S = XT. X

If EiVi = SVi ,

Then Ei is an Eigen Value, and Vi becomes the corresponding Vector.

If there are d dimensions, there will be d Eigenvalues with d corresponding Eigen Vectors, such that:

E1>=E2>=E3>=E4>=…>=Ed

Each corresponding to V1, V2, V3, …., Vd

Here the vector corresponding to the largest Eigenvalue is the direction of Maximal spread since rotation occurs such that V1 is aligned with maximal variance in the feature space. Vd here has the least variance in its direction.

A very interesting property of Eigenvectors is the fact that if any two vectors are picked randomly from the set of d vectors, they will turn out to be perpendicular to each other. This happens because they align themselves such that they catch the most opposing directions in terms of variance.

When deciding between two Eigen Vector directions, Eigenvalues come into play. If V1 and V2 are two Eigen Vectors (perpendicular to each other), the values associated with these vectors, E1 and E2, help us identify the “percentage of variance explained” in either direction.

Percentage of variance explained Ei/(Sum(d Eigen Values)) where i is the direction we wish to calculate the percentage of variance explained for.

Implementation

Principal Component Analysis in R can either be applied with manual code using the above mathematical intuition, or it can be done using R’s inbuilt functions.

Even if the mathematical concept failed to leave a lasting impression on your mind, be assured that it is not of great consequence. On the other hand, understanding the basic high-level intuition counts. Without using the mathematical formulas, PCA in R can be easily applied using R’s prcomp() and princomp() functions which can be found here.

In order to demonstrate Principal Component Analysis, we will be using R, one of the most widely used languages in Data Science and Machine Learning. R was initially developed as a tool to aid researchers and scientists dealing with statistical problems in the academic field. With time, as more individuals from the academic spheres started seeping into the corporate and industrial sectors, they brought along R and its phenomenal uses along with them. As R got integrated into the IT sector, its popularity increased manifold and several revisions were made with the release of every new version. Today R has several packages and integrated libraries which enables developers and data scientists to instantly access statistical solutions without having to go into the complicated details of the operations. Principal Component Analysis is one such statistical approach which has been taken care of very well by R and its libraries.

For demonstrating PCA in R, we will be using the Breast Cancer Wisconsin Dataset which can be downloaded from here: Data Link

wdbc <- read.csv(“wdbc.csv”, header = F)

features <- c(“radius”, “texture”, “perimeter”, “area”, “smoothness”, “compactness”, “concavity”, “concave_points”, “symmetry”, “fractal_dimension”)

names(wdbc) <- c(“id“, “diagnosis“, paste0(features,”_mean“), paste0(features,”_se“), paste0(features,”_worst“))

These code statements help to read data into the variables wdbc.

wdbc.pr <- prcomp(wdbc[c(3:32)], center = TRUE, scale = TRUE)
summary(wdbc.pr)

The prcomp() function helps to apply PCA in R on the data variable wdbc. This function of R makes the entire process of implementing PCA as simple as writing just one line of code. The internal operations and functions are taken care of and are even optimized in terms of memory and performance to carry out the operations optimally. The range 3:32 is used to tell the function to apply PCA only on the features or columns which lie in the range of 3 to 32. This excludes the sample ID and diagnosis variables since they are identification columns and are invalid as features with no direct significance with regard to the target variable.

wdbc.pr now stores the values of the principal components.

Let us now visualize the different attributes of the resulting Principal Components for the 30 features:

This piece of code yields the following results:

Image Courtesy: towards data science

This plot clearly demonstrates that the first 6 components account for 90% of the variance in the dataset (with Eigen Value > 1). This means that one can easily exclude 24 features out of 30 features in order to preserve 90% of the data.

Limitations of PCA

Even though Principal Component Analysis in R displays a highly intuitive technique, it hosts certain shocking limitations.

1. Loss of Variance: If the percentage of variance against the chosen axis is around 50-60%, it is evident that 40-50% of the information which contributes to the variance of the dataset is lost during dimensionality reduction. This happens often when the data is spherical or bulging in nature.

2. Loss of Clusters: If there are several clusters present in the original dataset, but most of them lie in the direction perpendicular to the chosen direction. Thus, all the points from different clusters will be projected to the same region on the line of chosen direction, leading to one cluster of data points which are in fact quite different in nature.

3. Loss of Data Patterns: If the dataset forms a nice wavy pattern in direction of maximal spread, PCA takes to project all the points on the line aligned against the direction. Thus, data points which formed a wave function are concentrated on one-dimensional space.

These demonstrate how PCA in R, even though very effective for certain datasets, is a weak instrument for dimensionality reduction or visualization. To resolve these limitations to a certain extent, t-SNE, which is another dimensionality reduction algorithm, is used. Stay tuned to our blogs for a similar and well-guided walkthrough in t-SNE.

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course, which is a step further into advanced data analysis and processing!

Furthermore, if you want to read more about data science, read our Data Science Blogs

Top 5 Statistical Concepts for Data Scientist

Top 5 Statistical Concepts for Data Scientist

Introduction

Data science is a comprehensive blend of maths, business and technology. One has to go from data inference to algorithm development and then all the way to use available technology to draw the solutions for complex problems. At its heart, all we have is data. All our inferences can only be brought up once we start its mining. In the end, data science uses multiple mathematical techniques to generate business value present in the data for various enterprises

On a very broad level, data science comprises of 3 important components namely maths or statistics, computer science and information science. A very strong statistical background knowledge is necessary if one is to pursue a career in data science. Various organisations prefer data scientists with strong statistical knowledge as statistics is one important component providing insights to leading businesses worldwide

In this blog, we will understand 5 important statistical concepts for data scientists. Let us understand them one by one in the next section.

 

Statistics and Data Science

Let us discuss the role of statistics in data science before beginning our journey into the math world!

In data science, you can always find statistics and computer science competing against each other for the ultimate supremacy. This happens, in particular, the areas concerning data acquisition and enrichment for predictive modelling

But somewhere, statistics have an upper hand as all the computer science applications in data science are more of it’s derivative

Statistics though is a key player in data science but not a solo player in any way. The real essence of data science can be obtained by combining statistics with the algorithms and mathematical modelling methods. Ultimately a balanced combination is required to generate a successful solution in data science

 

Important Concepts in Data Science

 

1. Probability Distributions

distribution of probabilities is characteristic that defines the likelihood that random variable can take feasible values. In other words, the variable values differ according to the fundamental spread of likelihoods.

Suppose you draw a random sample and are measuring the income of the individuals. You can start creating a distribution of income as you keep on collecting the data. Distributions are important in the scenarios where we need to find out outcomes with high likelihood and want to measure their predicted/potential values over a range

 

2. Dimensionality Reduction

In machine learning classification problems, on the basis of which the final classification is done, there are often too many factors. These factors are essentially so-called characteristics variables. The greater the amount of characteristics, the more difficult it becomes to visualize and operate on the training set. Most of these characteristics are sometimes linked and therefore redundant. This is where algorithms for the decrease of dimensionality come into practice. Dimensionality reduction can be looked as a method or means of eliminating a large number of variables to reach a smaller subset or arriving at the variables which matter more than then others. It can be split into a choice of features and removal of features.

An easy email classification problem can be used to discuss an intuitive instance of dimensionality reduction, where we need to identify whether the email is spam or not. This can include a big amount of characteristics, such as whether or not the email has a specific name, the email content, whether or not the email utilizes a model, etc. Some of these characteristics, however, may overlap. In another situation, a classification problem based on both humidity and rainfall may collapse into just one fundamental function, as both of the above are highly linked. Therefore, in such problems, we can reduce the number of features.

 

3. Over and Under-Sampling

In data sciences, we work with datasets representing some entities. It is required that all the entities have equal representation in the dataset which may not be the case every time. To cope with this, we have oversampling and undersampling as two measures in data science. These are data mining techniques and can modify unequal classes to create balanced sets. They are also known as resampling techniques

When one information category is the underrepresented minority group in the data sample, over-sampling methods can be used to replicate these outcomes for a more balanced quantity of beneficial teaching outcomes. Oversampling is used when there is inadequate information collection. SMOTE (Synthetic Minority Over-sampling Technique) is a common oversampling method that produces synthetic samples by randomly sampling the features of minority class events.

Also, If the information category is the over-represented majority class, undersampling can be used to mix this class with the minority class. Undersampling is used when there is an adequate quantity of information gathered. Common undersampling techniques include cluster centroids targeting prospective overlapping features within the gathered information sets to decrease the quantity of bulk information.

Simple duplication of information is seldom suggested in both oversampling and undersampling. Oversampling is generally preferable since undersampling can lead to the loss of significant information. Undersampling is suggested when the quantity of information gathered is greater than appropriate and can assist to keep information mining instruments within the boundaries of what they can process efficiently.

 

4. Bayesian Statistics

Bayesian statistics is an alternative paradigm in statistics as compared to the frequentist paradigm. It works on the principle of updating a pre-existing belief about random events. The belief gets updated after new data or evidence about that data pops in

Bayesian inference revolves around interpreting probability as one measure to evaluate the confidence of the occurrence of a particular event.

We may have a previous faith about an event, but when the fresh proof is put to light, our beliefs are probable to alter. Bayesian statistics provide us with a strong mathematical means of integrating our previous views and proof to generate fresh subsequent beliefs.

Bayesian statistics have the capability of providing methods to update our beliefs pertaining to the occurrence of an event in the light of new data or evidence

This contrasts with another type of inferential statistics, recognized as classical or frequency statistics, which believes that probabilities are the frequency of specific random occurrences that occur in a lengthy sequence of repeated trials.

For example, when we toss a coin repeatedly, in case of tossing a coin, we can find that the probability of heads or tail will come up to value close to 0.5.

Frequentist and Bayesian statistics span over different ideologies. For frequentist statistics, outcomes are thought to be observed over a large number of repeated trials and then all the observations are made as compared to Bayesian where our belief updates with every new event

By offering predictions, frequentist statistics attempt to eliminate the uncertainty. Bayesian statistics attempt to maintain and refine uncertainty by adapting personal views with fresh proof

 

5. Descriptive Statistics

This is the most prevalent of all types. It offers the analyst within the company with a perspective of important metrics and steps. Exploratory data analysis, unsupervised teaching, clustering and summaries of fundamental information are descriptive statistics. There are many uses of descriptive statistics, most particularly assisting us familiarize ourselves with an information collection. For any assessment, descriptive statistics are generally the starting point. Descriptive statistics often assist us to come up with hypotheses that will be checked subsequently with more official inference.

Descriptive statistics are very essential because it would be difficult to visualize what the information showed if we merely displayed our raw information, particularly if there were a bunch of them. Therefore, descriptive statistics enable us to show the information in a more significant manner, allowing the information to be interpreted more easily. For example, if we had the results of 1000 student marks for a specific student for the SAT exam, we might be interested in those students ‘ overall performance. We’d also be interested in spreading or distributing the marks. All the above-mentioned tasks and visualisations come under the idea of descriptive statistics

Let’s take an example here. Suppose you want to measure the demographics of the customers a retail giant is catering too. Now the retail giant is interested in understanding the variance present in the customer attributes and their shopping behaviours. For all these tasks, descriptive statistics is a bliss!

 

Conclusion

In this blog, we had a look at 5 most important concepts in statistics which every data scientist should know about. Although, we discussed them in detail these are not the only techniques in statistics. There are a lot more of them and are good to know!

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs