Data Science was called “The sexiest work of the 21st Century” by the Harvard Review. Data researchers as problematic solvers and analysts identify patterns, notice developments and make fresh findings and often use real-time information, machine learning, and IA. This is where Data Science Course comes into the picture.
There is a strong demand for information researchers and qualified data scientists. Projections from IBM suggest that by 2020 the figure of information researchers will achieve 28%. In the United States alone, there will be 2,7 million positions for all US information experts. In addition, we were provided more access to detailed analyzes by strong software programs.
Dimensionless Tech offers the finest online data science course and big data coaching to meet the requirement, offering extensive course coverage and case studies, completely hands-on-driven meetings with personal attention to each individual. This assessment is a gold mine with invaluable insights. To satisfy the elevated requirement. We only provide internet LIVE instruction for instructors and not instruction in the school.
About Dimensionless Technologies
Dimensionless Technologies is a training firm providing online live training in the sector of data science. Courses include–R&P data science, deep learning, large-scale analysis. It was created in 2014, with the goal of offering quality data science training for an inexpensive cost, by 2 IITians Himanshu Arora & Kushagra Singhania. Dimensionless provides a range of internet Data Science Live lessons. Dimensionless intends to overcome the constraints by giving them the correct skillset with the correct methodology, versatile, adaptable and versatile at the correct moment, which will assist learners to create informed business choices and sail towards a successful profession.
Why Dimensionless Technologies
Experienced Faculty and Industry experts
Data science is a very vast field and hence a comprehensive grasp over this subject requires a lot of effort. With our experienced faculties, we are committed to impart quality and practical knowledge to all the learners. Our faculty through their vast experience (10 plus industry experience) in the data science industry is best suited to show the right path to all students towards their success journey on the path of data science. Our trainer’s boast of their high academic career as well (IITian’s)!
End to End domain-specific projects
We, at Dimensionless, believe that concepts can be learned best when all the theory learned in the classroom can actually be implemented. With our meticulously designed courses and projects, we make sure our students get hands-on the projects ranging from pharma, retail, and insurance domains to banking and financial sector problems! End-to-end projects make sure that students understand the entire problem-solving lifecycle in data science
Up to date and adaptive courses
All our courses have been developed based on the recent trends in data science. We have made sure to include all the industry requirements for data scientists. Courses start from level 0 and assume no prerequisites. Courses make learners traverse from basic introductions to advanced concepts gradually with the constant assistance of our experienced faculties. Courses cover all the concepts to a great depth such that learners are never left wanting for more! Our courses have something or other for everyone whether you are a beginner or a professional.
Resource assistance
Dimensionless technologies have all the required hardware setup from running a regression equation to training a deep neural network. Our online-lab provides learners with a platform where they can execute all their projects. A laptop with bare minimum configuration (2GB RAM and Windows 7) is sufficient enough to pave your way into the world of deep learning. Pre-setup environments save a lot of time of learners in installing all the required tools. All the software requirements are loaded right in front of the accelerated learning
Live and interactive sessions
Dimensionless provides classes through live interactive classes on our platform. All the classes are taken live by instructors and are not in any pre-recorded format. Such format enables our learners to keep up their learning in the comfort of their own homes. You don’t need to waste your time and expenses in any travel and can take classes from any location of your preference. Also, after each class, we provide the recorded video of it to all our learners so that they can go through it to clear all their doubts. All trainers are available to post classes to clear the doubts as well
Lifetime access to study materials
Dimensionless provides lifetime access to the learning material provided in the course. Many other course providers provide access only till the time one is continuing with classes. With all the resources available thereafter, learnings for our students will not stop even after they have taken up our entire course
Placement assistance
Dimensionless technologies provide placement assistance to all its students. With highly experienced faculties and contacts in the industry, we make sure our students get their data science job and kick start their career. We help in all stages of placement assistance. From resume-building to final interviews, Dimensionless technologies is by your side to help you achieve all your goals
Course completion certificate
Apart from the training, we issue a course completion certificate once the training is complete. The certificate brings credibility to the resume of the learners and will help them in fetching their data science dream jobs
Small batch sizes
We make sure that we have small batch sizes of students. Keeping the batch size small allows us to focus on students individually and impart them a better learning experience. With personalized attention, we make sure students are able to learn as much possible and helps us to clear all their doubts as well
Conclusion
If you want to start a profession in data science, dimensionless systems have the correct classes for you. Not just all key ideas and techniques are covered but they are also implemented and used in real-world company issues.
You can follow this link for our Big Data course! This course will equip you with the exact skills required. Packed with content, this course teaches you all about AWS tools and prepares you for your next ‘Data Engineer’ role
Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course
Furthermore, if you want to read more about data science, read our Data Science Blogs
Reports suggest that around 2.5 quintillion bytes of data are generated every single day. As the online usage growth increases at a tremendous rate, there is a need for immediate Data Science professionals who can clean the data, obtain insights from it, visualize it, train model and eventually come up with solutions using Big data for the betterment of the world.
By 2020, experts predict that there will be more than 2.7 million data science and analytics jobs openings. Having a glimpse of the entire Data Science pipeline, it is definitely tiresome for a single human to perform and at the same time excel at all the levels. Hence, Data Science has a plethora of career options that require a spectrum set of skill sets.
Let us explore the top 5 data science career options in 2019 (In no particular order).
1. Data Scientist
Data Scientist is one of the ‘high demand’ job roles. The day to day responsibilities involves the examination of big data. As a result of the analysis of the big data, they also actively perform data cleaning and organize the big data. They are well aware of the machine learning algorithms and understand when to use the appropriate algorithm. During the due course of data analysis and the outcome of machine learning models, patterns are identified in order to solve the business statement.
The reason why this role is so crucial in any organisation is that the company tends to take business decisions with the help of the insights discovered by the Data Scientist to have an edge over the company’s competitors. It is to be noted that the Data Scientist role is inclined more towards the technical domain. As the role demands a wide range of skill set, Data Scientists are one among the highest paid jobs.
Core Skills of a Data Scientist
Communication
Business Awareness
Database and querying
Data warehousing solutions
Data visualization
Machine learning algorithms
2. Business Intelligence Developer
BI Developer is a job role inclined more towards the Non-Technical domain but has a fair share of Technical responsibilities as well (if required) as a part of their day to day responsibilities. BI developers are responsible for creating and implementing business policies as a result of the insights obtained from the Technical team.
Apart from being a policymaker involving the usage of dedicated (or custom) Business Intelligence analytics tools, they will also have a fair share of coding in order to explore the dataset, present the insights of the dataset in a non-verbal manner. They help in bridging the gap between the technical team that works with the deepest technical understanding and the clients that want the results in the most non-technical manner. They are expected to generate reports from the insights and make it ‘less technical’ for others in the organisation. It is noted that the BI Developers have a deep understanding of Business when compared to Data Scientist.
Core Skills of a Business Analytics Developer
Business model analysis
Data warehousing
Design of business workflow
Business Intelligence software integration
3. Machine Learning Engineer
Once the data is clean and ready for analysis, the machine learning engineers work on these big data to train a predictive model that predicts the target variable. These models are used to analyze the trends of the data in the future so that the organisation can take the right business decisions. As the dataset involved in a real-life scenario would involve a lot of dimensions, it is difficult for a human eye to interpret insights from it. This is one of the reasons for training machine learning algorithms as it easily deals with such complex dataset. These engineers carry out a number of tests and analyze the outcomes of the model.
The reason for conducting constant tests on the model using various samples is to test the accuracy of the developed model. Apart from the training models, they also perform exploratory data analysis sometimes in order to understand the dataset completely which will, in turn, help them in training better predictive models.
Core Skills of Machine Learning Engineers
Machine Learning Algorithms
Data Modelling and Evaluation
Software Engineering
4. Data Engineer
The pipeline of any data-oriented company begins with the collection of big data from numerous sources. That’s where the data engineers operate in any given project. These engineers integrate data from various sources and optimize them according to the problem statement. The work usually involves writing queries on big data for easy and smooth accessibility. Their day to day responsibility is to provide a streamlined flow of big data from various distributed systems. Data engineering differs from the other data science careers as in, it is concentrated on the system and hardware that aids the company’s data analysis, rather than the analysis of data itself. They provide the organisation with efficient warehousing methods as well.
Core Skills of Data Engineer
Database Knowledge
Data Warehousing
Machine Learning algorithm
5. Business Analyst
Business Analyst is one of the most essential roles in the Data Science field. These analysts are responsible for understanding the data and it’s related trend post the decision making about a particular product. They store a good amount of data about various domains of the organisation. These data are really important because if any product of the organisation fails, these analysts work on these big data to understand the reason behind the failure of the project. This type of analysis is vital for all the organisations as it makes them understand the loopholes in the company. The analysts not only backtrack the loophole and in turn provide solutions for the same making sure the organisation takes the right decision in the future. At times, the business analyst act as a bridge between the technical team and the rest of the working community.
Core skills of Business Analyst
Business awareness
Communication
Process Modelling
Conclusion
The data science career options mentioned above are in no particular order. In my opinion, every career option in Data Science field works complimentary with one another. In any data-driven organization, regardless of the salary, every career role is important at the respective stages in a project.
Data science is a booming industry, with potentially millions of job openings by 2020, according to the latest analyst’s business predictions. But what if you want to learn data science without the heavy cost of a postgraduate degree or the US university MOOC specialization? What is the best way to prepare for this upcoming wave of opportunity and maximize your chances for a 100K+ USD (annual) job? Well – there are many challenges that stand before you in such a case. Not only is the market saturated with an abundance of existing fresh talent, but most of the training you receive in college has no relationship to the actual type of work you get on the job. With so many engineering graduates passing out every year from so many established institutions such as the IITs, how can you hope to realistically compete? Well – there is one possibility you can choose if you wish to stand out from the rest of the competition – high-quality data science programs or courses. And in this article, we are going to list the top ten advantages of choosing such a course compared to other options, like a Ph.D., or an online MOOC Specialization from a US university (which are very tempting options, especially if you have the money for them).
Top Ten Advantages of Data Science Certification
1. Stick to Essentials, Cut the Fluff.
Now if you are a professional data scientist, no one expects you to derive any AI algorithms from first principles. You also don’t need to extensively dig into the (relatively) trivial history behind each algorithm, nor learn SVD (Singular Value Decomposition) or Gaussian Elimination on a real matrix without a computer to assist you. There is so much material that an academic degree covers that is never used on the job! Yes, you need to have an intuitive idea about the algorithms. But unless you’re going in for ML research, there’s not much use of knowing, say, Jacobians or Hessians in depth. Professional data scientists work in very different domains while compared to academic researchers or academic counterparts. Learn what you need on the job. If you try to cover everything mentioned in class, you’ve already lost the race. Focus on learning bare essentials thoroughly. You always have Google and StackOverflow to assist you as long as you’re not writing an exam!
2.Learning from Instructors with Work Experience, not PhD scientists!
Now from whom should you receive training? From PhD academics who’ve never worked on a real professional project but have published extensively, or instructors with real-life professional project experience? Very often, the teachers and instructors in colleges and universities belong to the former category, and you are remarkably fortunate if you have an instructor who has that invaluable component called industry experience. The latter category are rare and difficult to find, and you are lucky – even remarkably so – if you are studying under them. They will be able to teach you with context to the job experience in real-life, which is always exactly what you need the most.
3. Working with the Latest Technology Stacks.
Now, who would be better able to land you a job – teachers who teach what they studied ten years ago, or professionals who work with the latest tools available in the industry? It’s undoubtedly true that the people with industry experience can help you to choose what technologies you should learn and master. Academics, in comparison, could even be working with technology stacks over ten years old! Please try to stick with instructors who have work experience.
4. Individual Attention.
In a college or a MOOC with thousands of students, it’s simply not possible for each student to get individual attention. However, in data science programs, it is true that every student will receive individual attention tailored to their needs, which is exactly what you need. Every student is different and will have their own understanding of the projects available. This customized attention that is available when batch sizes are less than 30-odd is the greatest advantage such students have over college and MOOC students.
5. GitHub Project Portfolio Guidance.
Every college lecturer will advise you to develop a GitHub project portfolio, but they cannot give your individual profile genuine attention. The reason for that is that they have too many students and requirements upon their time to be able to spend time with individual project portfolios and actually mentor you in designing and establishing your own project portfolio. However, data science programs are different and it is genuinely possible for the instructors to mentor you individually in designing your project portfolios. Experienced industry professionals can even help you identify ‘niches’ within your fieldin which you can shine and carve out a special brand for your own project specialties so that you can really distinguish yourself and be a class apart from the rest of your competition.
6. Mentoring even After Getting Placed in a Company and Working by Yourself.
Trust me, no college professor will be able or even available to help you once you get placed within the industry since your domains will be so different. However, its a very different story with industry professionals who become instructors. You can even go to them or contact them for guidance even after placement, which is, simply not something most academic professors will be able to do unless they too have industry experience, which is very rare.
7. Placement Assistance.
People who have worked in the industry will know the importance of having company referrals in the placement process. It is one thing to have a cold call with a company with no internal referrals. Having someone already established within the company you apply to can be the difference between a successful and unsuccessful recruitment process. Every industry professional will have contacts in many companies, which puts them in a unique position to aid you at the time of placement opportunities.
8. Learn Critical but Non-Technical Job Skills, such as Networking, Communication, and Teamwork
teamwork in data science
While it is important to know the basics, one reason why brilliant students do badly in the industry after they get a job is the lack of soft skills like communication and teamwork. A job in the industry is so much more than bare skills studied in class. You need to be able to communicate effectively and to work well in teams, which can be guided by industry professionals but not by professors since they will have no experience in this area because they have never worked in the industry. Professionals will know who to guide you with regard to this aspect of your expertise, since its a case of being in that position and having learnt the necessary skills in the industry through their job experiences and work capacities.
9. Reduced Cost Requirements
It is one thing to be able to sponsor your own PhD doctoral fees. It is quite another thing to learn the very same skills for less than 1% of the cost of a PhD degree in, say, the USA. Not only is it financially less demanding, but you also don’t have to worry about being able to pay off massive student loans through industry work and fat paychecks, often at the cost of compromising your health or your family needs. Why take a Rs. 75 lakh student loan, when you can get the same outcome from a course less than 0.5% of the price? The takeaways will still be the same! In most cases, you will even receive better training through the data science program than an academic qualification because your instructors will have job experience.
10. Highly Reduced Time Requirements
A PhD degree takes, on average, 5 years. A data science program gets you job-ready in a few months time. Why don’t you decide which is better for you? This is especially true when you already have job experience in another domain or you are more than 23-25 years old, and doing a full PhD program could put you on the wrong side of 30 with almost no job experience. Please go for the data science program, since the time spent working in your 20s is critical for most companies who are hiring today since they consider you to a be a good ‘çultural fit’ for the company environment, especially when you have less than 3-4 years experience.
Summary
Thus, its easy to see that in so many ways, a data science program can be much better for you than a data science degree. So, the critical takeaway for this article is that there is no need to spend Rs. 75,000,000+ for skills which you can acquire for Rs. 35,000 max. It really is a no-brainer. These data science programs really offer true value for money. In case you’re interested, please do check out the following data science programs, each of which have every one of the advantages listed above:
In my previous post, we have covered Uni-Variate Analysis as an initial stage of data-exploration process. In this post, we will cover the bi-variate analysis. Objective of the bi-variate analysis is to understand the relationship between each pair of variables using statistics and visualizations. We need to analyze relationship between:
the target variable and each predictor variable two predictor variables
Why Analyze Relationship between Target Variable and each Predictor Variable
It is important to analyze the relationship between the predictor and the target variables to understand the trend for the following reasons:
The bi-variate analysis and our model should communicate the same story. This will help in understand and analysing the accuracy of our models, and to make sure, that our model has not over-fit the training data. If the data has too many predictor variables, we should include only those predictor variables in our regression models which show some trend with the target variable. Our aim with the regression models is to understand the story each significant variable is communicating, and its behaviour with other predictor variables and the target variable. A variable that has no pattern with the target variable, may not have a direct relation with the target variable (while a transformation of this variable might have a direct relation). If we understand the correlations and trends between the predictor and the target variables, we can arrive at better and faster transformations of predictor variables, to get more accurate models faster. Eg. Following curve indicates logarithmic relation between target and predictor variables. A curve of below-mentioned shape indicates that there is a logarithmic relation between x and y. In order to transform the above curve into linear, we need to take an exponential of 10 of the predictor variable. Hence, a simple scatter plot can give us the best estimate of variables – transformations required to arrive at the appropriate model.
Why Analyze Relationship between 2 Predictor Variables
It is important to understand the correlations between each pair of predictor variables. Correlated variables lead to multi-collinearity. Essentially, two correlated variables are transmitting the same information, and hence are redundant. Multi-collinearity leads to inflated error term and wider confidence interval (reflecting greater uncertainty in the estimate). When there are too many variables in a data-set, we use techniques like PCA for dimensionality reduction. Dimensionality reduction techniques work upon reducing the correlated variables, to reduce the extraneous information and so that we run our modelling algorithms on the variables that explain the maximum variance.
Method to do Bi-Variate Analysis
We have understood why bi-variate analysis is an essential step to data exploration. Now we will discuss the techniques to bi-variate analysis.
For illustration, we will use Hitters data-set from library ISLR. This is Major League Baseball Data from the 1986 and 1987 seasons. It is a data frame with 322 observations of major league players on 20 variables. The target variable is Salary while the rest of the 19 are dependent variables. Through this data-set, we will demonstrate bi-variate analysis between:
two continuous variables, one continuous and one categorical variable, two categorical variables
Following is the summary of all the variables in Hitters data-set:
data(Hitters) hitters=Hitters summary(hitters) ## AtBat Hits HmRun Runs ## Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00 ## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25 ## Median :379.5 Median : 96 Median : 8.00 Median : 48.00 ## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91 ## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00 ## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00 ## ## RBI Walks Years CAtBat ## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0 ## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8 ## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0 ## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7 ## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2 ## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0 ## ## CHits CHmRun CRuns CRBI ## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00 ## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75 ## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50 ## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12 ## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25 ## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00 ## ## CWalks League Division PutOuts Assists ## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0 ## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0 ## Median : 170.50 Median : 212.0 Median : 39.5 ## Mean : 260.24 Mean : 288.9 Mean :106.9 ## 3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0 ## Max. :1566.00 Max. :1378.0 Max. :492.0 ## ## Errors Salary NewLeague ## Min. : 0.00 Min. : 67.5 A:176 ## 1st Qu.: 3.00 1st Qu.: 190.0 N:146 ## Median : 6.00 Median : 425.0 ## Mean : 8.04 Mean : 535.9 ## 3rd Qu.:11.00 3rd Qu.: 750.0 ## Max. :32.00 Max. :2460.0 ## NA’s :59
Before we conduct the bi-variate analysis, we’ll seperate continuous and factor variables and perform basic cleaning through following code:
Please note, that the way to do the bi-variate analysis is same irrespective of predictor or target variable.
Bi-Variate Analysis between Two Continuous Variables
To do the bi-variate analysis between two continuous variables, we have to look at scatter plot between the two variables. The pattern of the scatter plot indicates the relationship between the two variables. As long as there is a pattern between two variables, there can be a transformation applied to the predictor / target variable to achieve a linear relationship for modelling purpose. If no pattern is observed, this implies no relationship possible between the two variables. The strength of the linear relationship between two continuous variables can be quantified using Pearson Correlation.A correlation coefficient of -1 indicates high negative correlation, 0 indicates no correlation, 1 indicates high positive correlation.
Correlation is simply the normalized co-variance with the standard deviation of both the factors. This is done to ensure we get a number between +1 and -1. Co-variance is very difficult to compare as it depends on the units of the two variables. So, we prefer to use correlation for the same.
Please note: * If two variables are linearly related, it means they should have high Pearson Correlation Coefficient. * If two variables are correlated does not indicate they are linearly related. This is because correlation is deeply impacted by outliers. Eg. the correlation for both the below graphs is same, but the linear relation is not there in the second graph:
If two variables are related, it does not mean they are necessarily correlated. Eg. in the below graph of y=x^2-1, x and y are related but not correlated (with correlation coefficient of 0). x=c(-1, -.75, -.5, -.25, 0, .25, .5, .75, 1) y=x^2-1 plot(x,y, col=”dark green”,”l”)
cor(x,y) ## [1] 0
For our hitters illustration, following are the correlations and the scatter-plots:
This gives the correlation-coefficients between the continuous variables in hitters data-set. Since it is difficult to analyse so many values, we prefer to attain a quick visualization of correlation through scatter plots. Following command gives the scatter plots for the first 4 continuous variables in hitters data-set:
pairs(hitters_cont[1:4], col=”brown”)
Observations:
Linear pattern can be observed between AtBat and Hits, and can be confirmed from the correlation value = 0.96 Linear pattern can be observed between Hits and Runs, and can be confirmed from the correlation value = 0.91 Linear pattern can be observed between AtBat and Runs, and can be confirmed from the correlation value = .899
To get a scatter plot and correlation between two continuous variables:
Correlation value of .96 verifies our claim of strong positive correlation. We can obtain better visualizations of the correlations through library corrgram and corrplot:
library(corrgram) corrgram(hitters)
Strong blue means strong +ve correlation. Strong red means strong negative correlation. Dark color means strong correlation. Weak color means weak correlation.
To find correlation between each continuous variable
library(corrplot) continuous_correlation=cor(hitters_cont) corrplot(continuous_correlation, method= “circle”, type = “full”, is.corr=T, diag=T)
This gives a good visual representation of the correlation and relationship between variables, especially when the number of variables is high. Dark blue and large circle represents high +ve correlation. Dark red and large circle represents high -ve correlation. Weak colors and smaller circles represent weak correlation.
Bi-Variate Analysis between Two Categorical Variables
2-way Frequency Table: We can make a 2-way frequency table to understand the relationship between two categorical variables.
2-Way Frequency Table
head(hitters_factor) ## League Division NewLeague ## -Alan Ashby N W N ## -Alvin Davis A W A ## -Andre Dawson N E N ## -Andres Galarraga N E N ## -Alfredo Griffin A W A ## -Al Newman N E A tab=table(League=hitters_factor$League, Division=hitters_factor$Division)#gives the frequency count tab ## Division ## League E W ## A 68 71 ## N 61 63
This gives the frequency count of League vs Division
Chi-Square Tests of Association: To understand if there is an association / relation between 2 categorical variables.
Dark Color represents higher frequency and lighter colors represent lower frequencies.
Fluctuation Plot
ggplot(runningcounts.df, aes(League, Division)) + geom_point(aes(size = Freq, color = Freq, stat = “identity”, position = “identity”), shape = 15) + scale_size_continuous(range = c(3,15)) + scale_color_gradient(low = “white”, high = “black”)+theme_bw()
Dark Color represents higher frequency and lighter colors represent lower frequencies.
Bi-Variate Analysis between a Continuous Variable and a Categorical Variable
Aggregations can be obtained using functions xtabs, aggregate or using dplyr library. Eg. In the Hitters Data-set, we will use Factor Variable: “Division”” and Continuous Var: “Salary”.
Aggregation of Salary Division-Wise
xtabs(hitters$Salary ~ hitters$Division) #gives the division-wise sum of salaries ## hitters$Division ## E W ## 80531.01 60417.50 aggregate(hitters$Salary, by=list(hitters$Division), mean,na.rm=T) #gives the division-wise mean of salaries ## Group.1 x ## 1 E 624.2714 ## 2 W 450.8769 hitters%>%group_by(Division)%>%summarise(Sum_Salary=sum(Salary, na.rm=T), Mean_Salary=mean(Salary,na.rm=T), Min_Salary=min(Salary, na.rm=T),Max_Salary = max(Salary, na.rm=T)) ## # A tibble: 2 × 5 ## Division Sum_Salary Mean_Salary Min_Salary Max_Salary ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 E 80531.01 624.2714 67.5 2460 ## 2 W 60417.50 450.8769 68.0 1900 T-Test: 2-Sample test (paired or unpaired) can be used to understand if there is a relationship between a continuous and a categorical variable. 2-sample T test can be used for categorical variables with only two levels. For more than two levels, we will use Anova.
Eg. Conduct T-test on Hitters data-set to check if Division (a predictor factor variable) has an impact on Salary (continuous target variable). H0: Mean of salary for Divisions E and W is same, so there is not much impact for divisions on salary HA: Mean of salay is different; so there is a deep impact of divisions on salary
df=data.frame(hitters$Division, hitters$Salary) head(df) ## hitters.Division hitters.Salary ## 1 W 475.0 ## 2 W 480.0 ## 3 E 500.0 ## 4 E 91.5 ## 5 W 750.0 ## 6 E 70.0 library(dplyr) df%>%filter(hitters.Division ==”W”)%>%data.frame()->sal_distribution_W df%>%filter(hitters.Division==”E”)%>%data.frame()->sal_distribution_E x=sal_distribution_E$hitters.Salary y=sal_distribution_W$hitters.Salary t.test(x, y, alternative=”two.sided”, mu=0) ## ## Welch Two Sample t-test ## ## data: x and y ## t = 3.145, df = 218.46, p-value = 0.001892 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 64.73206 282.05692 ## sample estimates: ## mean of x mean of y ## 624.2714 450.8769
2-sided indicates two tailed test. P-value of .19% indicates that we can reject the null hypothesis and there is a significant difference in mean salary based on division. Hence, division does make an impact on the salary.
Anova: Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as “variation” among and between groups).
Anova
anova=aov(Salary~Division, data = hitters) summary(anova) ## Df Sum Sq Mean Sq F value Pr(>F) ## Division 1 1976102 1976102 10.04 0.00171 ** ## Residuals 261 51343011 196717 ## — ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Since probability value is < 1%, so, the difference between the average salary for different divisions is significant.
Visualization can be obtained through box-plots, bar-plots and density plots.
Box Plots to establish relationship between Division variable and Salary Variable in Hitters data-set
The output indicates that there are 129 observations for E and 134 observations for W. The statistics of the output indicate: Minimum salary values for E and W divisions are: 67.5, 68 respectively. Salary values at 1st quartile for E and W divisions are 215 and 165 respectively. Salary values at 2nd quartile for E and w divisions are 517.14 and 375 respectively. This implies the median salary value of E division is 50% higher than W. Salary values at 3rd quartile for E and W divisions are 850 and 725 respectively. Maximum salary values for E and W divisions are 1800 and 1500 respectively.
Outliar values are indicated by $out values as classified in the $group. Outliar salary values are 1975, 1861.46, 2460, 1925.57, 2412.50, 2127.33, 1940 for Division E (group 1). Outliar salary value is 1900 for division W (group 2).
We can also use ggplot to make more beautiful visuals:
The above graph gives a visualization of frequency count of Salary buckets as per the division
Dodged Graph
p=ggplot(hitters, aes(x=Salary)) p+geom_histogram(aes(fill=Division), position = “dodge”, bins=30)+xlab(“Salary”)+theme_classic()
Stacked Graph
p=ggplot(hitters, aes(x=Salary)) p+geom_histogram(aes(fill=Division), position = “stack”, alpha=.7)+xlab(“Salary”)+theme_classic() ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot indicates high density for division W in the salary range 0 to 1000.
Conclusion
Bi-Variate Analysis provides a visual representation of the inter-relationship between the predictor variables. If the correlation between the predictor variables is high, then we would have to reduce the correlated variables in order to avoid multi-collinearity in the prediction models.
We also need to understand the bi-variate relationship between target variables and the predictor variables based on which we understand, analyze and validate our modeling results.
In essence it is one of the most important steps which gives us the insights on the interaction between the variables.