Company: Amdocs Profile: SQL, UNIX Developer (Ordering and Billing) Designation: System Engineer Domain: Telecom Experience: 2 years
Company: TSystems Designation: Consultant Profile: Data Science Designation: Data Scientist Domain: Telecom
My journey into Data Science
Why Data Science?
Day by day, the technology is evolving. I didn’t see myself getting the career growth I wanted with the technology I was working on before (SQL/UNIX). As anyone in IT would know, having a job as a System Engineer in SQL/UNIX these days is very mundane. I thought I wouldn’t survive for long. That was the main motivation to keep myself updated with the latest tech.
When it came to choosing the new tech, I found myself being more keen towards Data Science. It’s very interesting and insightful. It called to my intellectual side. It’s like you’re creatively playing with data and getting business results.
Why Dimensionless?
It’s a funny story. When I decided to go with Data Science, I enrolled in a classroom course mainly because I was never comfortable with Online Classes. It was mostly theory and my experience was neutral. Since I was already working as a System Engineer using SQL, I had the database part covered at my end.
It was only when I started giving interviews, that I realized that their course curriculum and faculty was sub-par. A lot of the interview questions asked were not even covered in the lectures. When I went back to them with these doubts, they said it was out of syllabus. Then I tried to learn by myself. I checked out some free courses.
In one of my interviews, I met this guy. He was a fellow candidate and he seemed pretty confident. We got to talking and he told me that he did a Data Science specialization course from Dimensionless. He was so satisfied with the course that I could feel his genuineness. Obviously, I got very excited to know more about Dimensionless.
Next thing I did was I spoke to their counsellors and joined in the next batch itself.
Experience with Dimensionless?
TBH, before taking up this course with Dimensionless, I was convinced that I can only learn properly in a physical classroom. I thought physical classrooms provide more support and are more accessible. Now I am much more comfortable in online training. Online or offline, if the teachers are good and doubts are handled, it doesn’t matter. It was so comfortable to attend classes from anywhere.
I knew why that guy in the interview was so happy with Dimensionless. The doubt-solving was quick. Teachers were available on call too. The sessions had a lot of communication and they were interactive overall. The course content was practical and easy to follow too.
Career Transition to Data Science
After completing the course in 5 months, I did some more self-study because I thought why someone would select me over an experienced Data Science professional. I delayed applying for jobs. Finally, after some moral support from their mentors and mock interviews with Dimensionless HR, I built up the courage to apply for jobs and give interviews. Among other companies, I applied at TSystems, Wipro and Capegemini, and I got selected in all three!!! Imagine my excitement.
When I started giving interviews I realized that interviewers judge you based on knowledge and not past experience.
Do you also want career transition like Ruchi? Follow this link, and make it possible with Dimensionless Techademy!
Furthermore, if you want to read more about data science, you can read our blogs here.
Company: Ericsson Profile: Network Engineer Domain: Telecom
Company: Ericsson Profile: Data Science Analyst Domain: Telecom
Company: Affine Analytics Designation: Data Science Associate Profile: Sr. Business Analyst Domain: Retail
My journey into Data Science
Why Data Science?
In my previous profile, the work was very manual and monotonous, doing the same thing time and again. I was not satisfied with my work. I knew I had to change something. At that time, I was considering Data Science as well as Big Data since these two have the maximum scope and good pay as well. Maths and Stats were always my strong points and I am technically strong too. Considering this, Data Science looked like an exciting journey.
Why Dimensionless?
I tried to learn by myself through other online learning classes. I also took a course through Udemy. The pre-recorded videos were a drag. It was not interactive and I had a lot of doubts. That is when I came across Dimensionless.
Compared to other courses, this one had a detailed syllabus with 200 hours Live and Interactive training. I was sure about joining this course as soon as I attended the Demo. I took a lot of other courses but there was something or the other missing in them. With Dimensionless, I found all of it in one place.
Experience with Dimensionless?
It was exactly as I wanted it, a good mix of theory and practical. The classes were very interactive and teachers were always available for doubt-solving. They also helped me with my additional self-studies. I went to them with topics that were not in the syllabus and still got support. There was ample pre-recorded content as well, that we could refer to after the live classes.
Career Transition to Data Science
I’ll be honest, I didn’t get through any interviews at first. With mentors at Dimensionless, I got feedback on my performance at the interviews. The career mentoring facility helped me understand what I was interested in and which jobs I should be applying to accordingly. The HR guided me about which companies and jobs I need to apply to, weigh the advantages and disadvantages of the profiles.
This started giving me confidence. Finally, I got shortlisted for multiple companies, one of them was through Dimensionless. I joined Affine Analytics with almost 70% hike from my previous job. I am thankful to Dimensionless for all of this.
I was an Electronics Engineer in Aerospace and I couldn’t see any growth in my domain. The opportunities to learn new things were limited, which lead to no growth and it became less and less exciting to me every day.
I realized I had to switch to software and upgrade my skills as per the demand to stay relevant throughout my career. I spoke with many of my peers, did some research, and found a few career choices viable as per the market right now. Data Science looked like an interesting career choice but I still remember having so many doubts!
Why Dimensionless?
As I was researching, I came across Dimensionless on Google and enrolled for a Demo session. I asked them all the questions and doubts I had about taking up Data Science. I literally bombarded them with questions like… How difficult is it going to be without having much of programming knowledge? Is having no previous work-experience okay? And does it count? How does Data Science fit in my domain (Aerospace)?
They answered all of it with patience and logic. I also had a one-on-one career counselling session with their counsellor.
Then there were other things to consider, like if I can attend the classes regularly, if the fee is viable, if I can get back to studies after such a long break, etc.
So, I went for the Experience and Pay option. Attended the classes for 2 weeks, I liked their methodology, and since I could understand what I was learning, I found myself attending lectures regularly along with my work and without being too stressed. And then, I continued and completed the entire course.
Experience with Dimensionless?
The course structure and methodology is not too stressing. The teaching pace wasn’t too stressing. Doubt-solving was immediate. I could get my doubts solved during the class, in the doubt-solving sessions or even one-on-one with the respective teacher. And trust me, coming from a non-programming background, I had a lot of doubts. Mentors and teachers were always available answering doubts.
Career Transition to Data Science
Resume-building sessions made me understand how to steer my career towards Data Science in Aerospace. When mentors started giving us projects, I got to choose projects from my domain so I could build upon my experience. This helped me get ready for interviews more than anything. Knowing theory is one thing, but the interviewers ask very technical and practical questions.
About only 70% of the course was done when I applied for an internal-switch at my company and got accepted. In fact, I even got a promotion and didn’t have to apply anywhere else.
As we move towards a data-driven world, we tend to realize how the power of analytics could unearth the most minute details of our lives. From drawing insights from data to making predictions of some unknown scenarios, both small and large industries are thriving under the power of big data analytics.
A-Z Glossary
There are various terms, keywords, concepts that are associated with Analytics. This field of study is broad, and hence, it could be overwhelming to know each one of it. This blog covers some of the critical concepts in analytics from A-Z, and explain the intuition behind that.
A: Artificial Intelligence – AI is the field of study which deals with the creation of intelligent machines that could behave like humans. Some of the widespread use cases where Artificial Intelligence has found its way are ChatBots, Speech Recognition, and so on.
There are two main types of Artificial Intelligence –Narrow AI, and Strong AI. A poker game is an example for the weak or the narrow AI where you feed all the instructions into the machines. It is trained to understand every scenario and incapable of performing something on their own.
On the other hand, a Strong AI thinks and acts like a human being. It is still far-fetched, and a lot of work is being done to achieve ground-breaking results.
B: Big Data – The term Big Data is quite popular and is being used frequently in the analytical ecosystem. The concept of big data came into being with the advent of the enormous amount of unstructured data. The data is getting generated from a multitude of sources which bears the properties of volume, veracity, value, and velocity.
Traditional file storage systems are incapable of handling such volumes of data, and hence companies are looking into distributed computing to mine such data. Industries which makes full use of the big data are way ahead off their peers in the market.
C: Customer Analytics – Based on the customer’s behavior, relevant offers delivered to them. This process is known as Customer Analytics. Understanding the customer’s lifestyle and buying habits would ensure better prediction of their purchase behaviors, which would eventually lead to more sales for the company.
The accurate analysis of customer behavior would increase customer loyalty. It could reduce campaign costs as well. The ROI would increase when the right message delivered to each segmented group.
D: Data Science – Data Science is a holistic term which involves a lot of processes which includes data extraction, data pre-processing, building predictive models, data visualization, and so on. Generally, in big companies, the role of a Data Scientist is well defined unlike in startups where you would need to look after all the aspects of an end-to-end project.
source: Towards Data Science
To be a Data Scientist, you need to be fluent in Probability, and Statistics as well, which makes it a lucrative career. There are not many qualified Data Scientists out there, and hence mastering the relevant skills could put you in a pole position in the job market.
E: Excel –An old, and yet the most used after visualization tool in the market is Microsoft Excel. Excel is used in a variety of ways while presenting the data to the stakeholders. The graphs and charts lay down the proper demonstration of the work done, which makes it easier for the business to take relevant decisions.
Moreover, Excel has a rich set of utilities which could useful in analyzing structured data. Most companies still need personnel with the knowledge of MS Excel, and hence, you must master it.
F: Financial Analytics – Financial Data such as accounts, transactions, etc., are private and confidential to an individual. Banks refrain from sharing such sensitive data as it could breach privacy and lead to financial damage of a customer.
However, such data if used ethically could save losses for a bank by identifying potential fraudulent behaviors. It would also be used to predict the loan defaulting probability. Credit scoring is another such use case of financial analytics.
G: Google Analytics – For analyzing website traffic, Google provides a free tool known as Google Analytics. It is useful to track any marketing campaign which would give an idea about the behavior of customers.
There are four levels via which the Google Analytics collects the data – User level which understands each user’s actions, Session level which monitors the individual visit, Page view level which gives information about each page views, and Event level which is about the number of button clicks, views of videos, and so on.
H: Hadoop –The framework most commonly used to store, and manipulate big data is known as Hadoop. As a result of high computing power, the data is processed fast in Hadoop.
Moreover, parallel computing in multiple clusters protects the loss of data and provides more flexibility. It is also cheaper, and could easily be scaled to handle more data.
I: Impala – Impala is a component of Hadoop which provides a SQL query engine for data processing. Written in Java, and C++, Impala is better than other SQL engines. Use SQL; the communication enabled between users and the HDFS, which is faster than Hive. Additionally, different formats of a file could also be read using Impala.
J: Journey Analytics – A sequential journey related to customer experience, which meets a specific business referred to as Journey Analytics. Over time, a customer’s interaction with the company compiled from its journey analytics.
K: K-means clustering – Clustering is a technique where you group a dataset into some small groups based on the similar properties shared among the members of the same group.
K-Means clustering is one such clustering algorithm where an unsupervised dataset split into k number of groups or clusters. K-Means clustering could be used to group a set of customers or products resembling similar properties.
L: Latent Dirichlet Allocation – LDA or Latent Dirichlet Allocation is a technique used over textual data in use cases such as topic modeling. Here, a set of topics imagined by the LDA representing a set of words. Then, it maps all the documents to the topics ensuring that those imaginary topics capture words in each text.
M: Machine Learning – Machine Learning is a field of Data Science which deals with building predictive models to make better business decisions.
A machine or a computer is first trained with some set of historical data so that it finds patterns in it, and then predict the outcome on an unknown test set. There are several algorithms used in Machine Learning, one such being K-means clustering.
source: TechLeer
N: Neural Networks – Deep Learning is the branch of Machine Learning, which thrives on large complex volumes of data and is used to cases where traditional algorithms are incapable of producing excellent results
Under the hood, the architecture behind Deep Learning is Neural Networks, which is quite similar to the neurons in the human brain.
O: Operational Analytics –The analytics behind the business, which focuses on improving the present state of operations, referred to as Operational Analytics.
Various data aggregation and data mining tools used which provides a piece of transparent information about the business. People who are expert in this field would use operational software provided knowledge to perform targeted analysis.
P: Pig –Apache Pig is a component of Hadoop which is used to analyze large datasets by parallelized computation. The language used is called Pig Latin.
Several tasks, such as Data Management could be served using Pig Latin. Data checking and filtering could be done efficiently and quickly with Pig.
Q: Q-Learning –It is a model-free reinforcement learning algorithm which learns a policy by informing an agent the actions to be taken under specific certain circumstances. The problems handled with stochastic transitions and rewards, and it doesn’t require adaptations.
R: Recurrent Neural Networks –RNN is a neural network where the input to the current step is the output from the previous step.
It used in cases such as text summarization was to predict the next word, the last words are needed to remember. The issue of the hidden layer was solved with the advent of RNN as it recalls sequence information.
S: SQL –One of the essential skill in analytics is Structured Query Language or SQL. It is used in RDBMS to fetch data from tables using queries.
Most companies use SQL for their initial data validation and analysis. Some of the standard SQL operations used are joins, sub-queries, window functions, etc.
T: Traffic Analytics –The study of analyzing a website’s source of traffic by looking into its clickstream data is known as traffic analytics. It could help in understanding whether direct, social, paid traffic, etc., are bringing in more users.
U: Unsupervised Machine Learning –The type of machine learning which deals with unlabeled data is known as unsupervised machine learning.
Here, no labels provided for a corresponding set of features, and information is grouped based on the similarity in the properties shared by the members of each group. Some of the unsupervised algorithms are PCA, K-Means, and so on.
V: Visualization –The analysis of data is useless if not presented in the forms of graphs and charts to the business. Hence, Data visualization is an integral part of any analytics project and also one of the key steps in data pre-processing and feature engineering.
W: Word2vec –It is a neural network used for text processing which takes in a text as input and output are a set of feature vectors of the words.
Some of the applications of word2vec are in genes, social media graphs, likes, etc. In a vector space, similar words are grouped using word2vec without the need for human intervention.
X: XGBoost –Boosting is a technique in machine learning by which a strong learner strengthens a weak learner in subsequent steps.
XGBoost is one such boosting algorithm which is robust to outliers, or NULL values. It is the go-to algorithm in Machine Learning competitions for its speed and accuracy.
Y: Yarn –YARN is a component of Hadoop which lies between HDFS, and the processing engines. In individual cluster nodes, the processing operations monitored by YARN.
The dynamic allocation of resources is also handled by it, which improves application performance and resource utilization.
Z: Z-test –A type of hypothesis testing used to determine whether to reject or accept the NULL hypothesis. By how many standard deviations, a data point is further away from the mean could be calculated using Z-test.
Conclusion
In this blog post, we covered some of the terms related to the analytics starting with each letter in the English.
If you are willing to learn more about Analytics, follow the blogs and courses of Dimensionless.
Follow this link, if you are looking to learn more about data science online!
Additionally, if you are having an interest in learning Data Science, Learnonline Data Science Course to boost your career in Data Science.
Furthermore, if you want to read more about data science, you can read our blogs here
Principal Component Analysis or PCA is one of the simplest and fundamental techniques used in machine learning. It is perhaps one of the oldest techniques available for dimensionality reduction, and thus, its understanding is of paramount importance for any aspiring Data Scientist/Analyst. An in-depth understanding of PCA in R will not only help in the implementation of effective dimensionality reduction but also help to build the foundation for development and understanding of other advanced and modern techniques.
PCA aims to achieve two
primary goals:
1. Dimensionality
Reduction
Real-life data has several features generated from numerous resources. However, our machine learning algorithms are not adept enough to handle high dimensions efficiently. Feeding several features, all at once, almost always leads to poor results since the models cannot grasp and learn from such volume altogether. This is called the “Curse of Dimensionality” which leads to unsatisfactory results from the models implemented. Principal Component Analysis in R helps resolve this problem by projecting n dimensions to n-x dimensions (where x is a positive number), preserving as much variance as possible. In other words, PCA in R reduces the number of features by transforming the features into a lesser number of projections of themselves.
2. Visualization
Our visualization systems are limited to 2-dimensional space which prevents us from forming a visual idea of the high dimensional features in the dataset. PCA in R resolves this problem by projecting n dimensions to a 2-D environment, enabling sound visualization. These visualizations sometimes reveal a great deal about the data. For instance, the new feature projections may form clusters in the 2-D space which was previously not perceivable in higher dimensions.
Intuition
Principal Component Analysis in R works with the simple idea of projection of a higher space to a lower space or dimension
The two alternate objectives of Principal Component Analysis are:
1. Variance Maximization
Formulation
2. Distance Minimization
Formulation
Let us demonstrate the above with the help of simple examples. If you have 2 features, and you wish to reduce the features to a 1-D feature set using PCA in R, you must lookout for the direction with maximal spread/variance. This becomes the new direction on which every data point is projected. The direction perpendicular to this direction has the least variance, and is thus, discarded.
Alternately, if one focuses on the perpendicular distance between a data point and the direction of maximum variance, our objective shifts to the minimization of that distance. This is because, lesser the distance, higher is the authenticity of the projection.
On completion of these projections, you would have successfully transformed your 2-D data to a 1-D dataset.
Mathematical Intuition
Principal Component Analysis in R locates the distance of maximal spread (or direction of minimal distance from data points) with the use of Eigen Vectors and Eigen Values. Every Eigen Vector (Vi) corresponds to an Eigen Value (Ei).
If X is a feature matrix (matrix with the feature values),
covariance matrix S = XT. X
If EiVi = SVi ,
Then Ei is an Eigen Value, and Vi becomes the corresponding Vector.
If there are d dimensions, there will be d Eigenvalues with d corresponding Eigen Vectors, such that:
E1>=E2>=E3>=E4>=…>=Ed
Each corresponding to V1, V2, V3, …., Vd
Here the vector corresponding to the largest Eigenvalue is the direction of Maximal spread since rotation occurs such that V1 is aligned with maximal variance in the feature space. Vd here has the least variance in its direction.
A very interesting property of Eigenvectors is the fact that if any two vectors are picked randomly from the set of d vectors, they will turn out to be perpendicular to each other. This happens because they align themselves such that they catch the most opposing directions in terms of variance.
When deciding between two Eigen Vector directions, Eigenvalues come into play. If V1 and V2 are two Eigen Vectors (perpendicular to each other), the values associated with these vectors, E1 and E2, help us identify the “percentage of variance explained” in either direction.
Percentage of variance explained Ei/(Sum(d Eigen Values)) where i is the direction we wish to calculate the percentage of variance explained for.
Implementation
Principal Component Analysis in R can either be applied with manual code using the above mathematical intuition, or it can be done using R’s inbuilt functions.
Even if the mathematical concept failed to leave a lasting impression on your mind, be assured that it is not of great consequence. On the other hand, understanding the basic high-level intuition counts. Without using the mathematical formulas, PCA in R can be easily applied using R’s prcomp() and princomp() functions which can be found here.
In order to demonstrate Principal Component Analysis, we will be using R, one of the most widely used languages in Data Science and Machine Learning. R was initially developed as a tool to aid researchers and scientists dealing with statistical problems in the academic field. With time, as more individuals from the academic spheres started seeping into the corporate and industrial sectors, they brought along R and its phenomenal uses along with them. As R got integrated into the IT sector, its popularity increased manifold and several revisions were made with the release of every new version. Today R has several packages and integrated libraries which enables developers and data scientists to instantly access statistical solutions without having to go into the complicated details of the operations. Principal Component Analysis is one such statistical approach which has been taken care of very well by R and its libraries.
For demonstrating PCA in R, we will be using the Breast Cancer Wisconsin Dataset which can be downloaded from here: Data Link
These code statements help to read data into the variables wdbc.
wdbc.pr <- prcomp(wdbc[c(3:32)], center = TRUE, scale = TRUE) summary(wdbc.pr)
The prcomp() function helps to apply PCA in R on the data variable wdbc. This function of R makes the entire process of implementing PCA as simple as writing just one line of code. The internal operations and functions are taken care of and are even optimized in terms of memory and performance to carry out the operations optimally. The range 3:32 is used to tell the function to apply PCA only on the features or columns which lie in the range of 3 to 32. This excludes the sample ID and diagnosis variables since they are identification columns and are invalid as features with no direct significance with regard to the target variable.
wdbc.pr
now stores the values of the principal components.
Let us now
visualize the different attributes of the resulting Principal Components for
the 30 features:
screeplot(wdbc.pr, type = "l", npcs = 15, main = "Screeplot of the first 10 PCs")
This plot
clearly demonstrates that the first 6 components account for 90% of the variance
in the dataset (with Eigen Value > 1). This means that one can easily
exclude 24 features out of 30 features in order to preserve 90% of the data.
Limitations of PCA
Even though Principal Component Analysis in R displays a highly intuitive technique, it hosts certain shocking limitations.
1. Loss of Variance: If the percentage of variance against the chosen axis is around 50-60%, it is evident that 40-50% of the information which contributes to the variance of the dataset is lost during dimensionality reduction. This happens often when the data is spherical or bulging in nature.
2. Loss of Clusters: If there are several clusters present in the original dataset, but most of them lie in the direction perpendicular to the chosen direction. Thus, all the points from different clusters will be projected to the same region on the line of chosen direction, leading to one cluster of data points which are in fact quite different in nature.
3. Loss of Data Patterns: If the dataset forms a nice wavy pattern in direction of maximal spread, PCA takes to project all the points on the line aligned against the direction. Thus, data points which formed a wave function are concentrated on one-dimensional space.
These demonstrate how PCA in R, even though very effective for certain datasets, is a weak instrument for dimensionality reduction or visualization. To resolve these limitations to a certain extent, t-SNE, which is another dimensionality reduction algorithm, is used. Stay tuned to our blogs for a similar and well-guided walkthrough in t-SNE.