Data science is a comprehensive blend of maths, business and technology. One has to go from data inference to algorithm development and then all the way to use available technology to draw the solutions for complex problems. At its heart, all we have is data. All our inferences can only be brought up once we start its mining. In the end, data science uses multiple mathematical techniques to generate business value present in the data for various enterprises
On a very broad level, data science comprises of 3 important components namely maths or statistics, computer science and information science. A very strong statistical background knowledge is necessary if one is to pursue a career in data science. Various organisations prefer data scientists with strong statistical knowledge as statistics is one important component providing insights to leading businesses worldwide
In this blog, we will understand 5 important statistical concepts for data scientists. Let us understand them one by one in the next section.
Statistics and Data Science
Let us discuss the role of statistics in data science before beginning our journey into the math world!
In data science, you can always find statistics and computer science competing against each other for the ultimate supremacy. This happens, in particular, the areas concerning data acquisition and enrichment for predictive modelling
But somewhere, statistics have an upper hand as all the computer science applications in data science are more of it’s derivative
Statistics though is a key player in data science but not a solo player in any way. The real essence of data science can be obtained by combining statistics with the algorithms and mathematical modelling methods. Ultimately a balanced combination is required to generate a successful solution in data science
Important Concepts in Data Science
1. Probability Distributions
A distribution of probabilities is a characteristic that defines the likelihood that a random variable can take feasible values. In other words, the variable values differ according to the fundamental spread of likelihoods.
Suppose you draw a random sample and are measuring the income of the individuals. You can start creating a distribution of income as you keep on collecting the data. Distributions are important in the scenarios where we need to find out outcomes with high likelihood and want to measure their predicted/potential values over a range
2. Dimensionality Reduction
In machine learning classification problems, on the basis of which the final classification is done, there are often too many factors. These factors are essentially so-called characteristics variables. The greater the amount of characteristics, the more difficult it becomes to visualize and operate on the training set. Most of these characteristics are sometimes linked and therefore redundant. This is where algorithms for the decrease of dimensionality come into practice. Dimensionality reduction can be looked as a method or means of eliminating a large number of variables to reach a smaller subset or arriving at the variables which matter more than then others. It can be split into a choice of features and removal of features.
An easy email classification problem can be used to discuss an intuitive instance of dimensionality reduction, where we need to identify whether the email is spam or not. This can include a big amount of characteristics, such as whether or not the email has a specific name, the email content, whether or not the email utilizes a model, etc. Some of these characteristics, however, may overlap. In another situation, a classification problem based on both humidity and rainfall may collapse into just one fundamental function, as both of the above are highly linked. Therefore, in such problems, we can reduce the number of features.
3. Over and Under-Sampling
In data sciences, we work with datasets representing some entities. It is required that all the entities have equal representation in the dataset which may not be the case every time. To cope with this, we have oversampling and undersampling as two measures in data science. These are data mining techniques and can modify unequal classes to create balanced sets. They are also known as resampling techniques
When one information category is the underrepresented minority group in the data sample, over-sampling methods can be used to replicate these outcomes for a more balanced quantity of beneficial teaching outcomes. Oversampling is used when there is inadequate information collection. SMOTE (Synthetic Minority Over-sampling Technique) is a common oversampling method that produces synthetic samples by randomly sampling the features of minority class events.
Also, If the information category is the over-represented majority class, undersampling can be used to mix this class with the minority class. Undersampling is used when there is an adequate quantity of information gathered. Common undersampling techniques include cluster centroids targeting prospective overlapping features within the gathered information sets to decrease the quantity of bulk information.
Simple duplication of information is seldom suggested in both oversampling and undersampling. Oversampling is generally preferable since undersampling can lead to the loss of significant information. Undersampling is suggested when the quantity of information gathered is greater than appropriate and can assist to keep information mining instruments within the boundaries of what they can process efficiently.
4. Bayesian Statistics
Bayesian statistics is an alternative paradigm in statistics as compared to the frequentist paradigm. It works on the principle of updating a pre-existing belief about random events. The belief gets updated after new data or evidence about that data pops in
Bayesian inference revolves around interpreting probability as one measure to evaluate the confidence of the occurrence of a particular event.
We may have a previous faith about an event, but when the fresh proof is put to light, our beliefs are probable to alter. Bayesian statistics provide us with a strong mathematical means of integrating our previous views and proof to generate fresh subsequent beliefs.
Bayesian statistics have the capability of providing methods to update our beliefs pertaining to the occurrence of an event in the light of new data or evidence
This contrasts with another type of inferential statistics, recognized as classical or frequency statistics, which believes that probabilities are the frequency of specific random occurrences that occur in a lengthy sequence of repeated trials.
For example, when we toss a coin repeatedly, in case of tossing a coin, we can find that the probability of heads or tail will come up to value close to 0.5.
Frequentist and Bayesian statistics span over different ideologies. For frequentist statistics, outcomes are thought to be observed over a large number of repeated trials and then all the observations are made as compared to Bayesian where our belief updates with every new event
By offering predictions, frequentist statistics attempt to eliminate the uncertainty. Bayesian statistics attempt to maintain and refine uncertainty by adapting personal views with fresh proof
5. Descriptive Statistics
This is the most prevalent of all types. It offers the analyst within the company with a perspective of important metrics and steps. Exploratory data analysis, unsupervised teaching, clustering and summaries of fundamental information are descriptive statistics. There are many uses of descriptive statistics, most particularly assisting us familiarize ourselves with an information collection. For any assessment, descriptive statistics are generally the starting point. Descriptive statistics often assist us to come up with hypotheses that will be checked subsequently with more official inference.
Descriptive statistics are very essential because it would be difficult to visualize what the information showed if we merely displayed our raw information, particularly if there were a bunch of them. Therefore, descriptive statistics enable us to show the information in a more significant manner, allowing the information to be interpreted more easily. For example, if we had the results of 1000 student marks for a specific student for the SAT exam, we might be interested in those students ‘ overall performance. We’d also be interested in spreading or distributing the marks. All the above-mentioned tasks and visualisations come under the idea of descriptive statistics
Let’s take an example here. Suppose you want to measure the demographics of the customers a retail giant is catering too. Now the retail giant is interested in understanding the variance present in the customer attributes and their shopping behaviours. For all these tasks, descriptive statistics is a bliss!
In this blog, we had a look at 5 most important concepts in statistics which every data scientist should know about. Although, we discussed them in detail these are not the only techniques in statistics. There are a lot more of them and are good to know!
Data scientists are the no. 1 most promising job in America for 2019, according to a Thursday report from LinkedIn. Hence, this comes as no surprise: Data scientist topped Glassdoor’s list of Best Jobs in America for the past three years, with professionals in the field reporting high demand, high salaries, and high job satisfaction.
Also, with the increase in demand, employers are looking for more skills in modern day data scientists. Furthermore, a modern-day data scientist needs to be a good player in aspects like maths, programming, communication and problem-solving.
In this blog, we are going to explore if knowledge of mathematics is really necessary to become good data scientists. Also, we will try to explore ways, if any, through which one can become a good data scientist without learning maths.
What all it takes for a modern day Data Scientist
Data scientists continue to be in high demand, with companies in virtually every industry looking to get the most value from their burgeoning information resources. Additionally, this role is important, but the rising stars of the business are those savvy data scientists that have the ability to not only manipulate vast amounts of data with sophisticated statistical and visualization techniques but have a solid acumen from which they can derive forward-looking insights, Boyd says. Also, these insights help predict potential outcomes and mitigate potential threats to the business. Additionally, key skills of modern-day data scientists are as follows
1. Critical thinking
Data scientists need to be critical thinkers, to be able to apply the objective analysis of facts on a given topic or problem before formulating opinions or rendering judgments. Also, they need to understand the business problem or decision being made and be able to ‘model’ or ‘abstract’ what is critical to solving the problem, versus what is extraneous and can be ignored.
Top-notch data scientists know how to write code and are comfortable handling a variety of programming tasks. Furthermore, to be really successful as a data scientist, the programming skills need to comprise both computational aspects — dealing with large volumes of data, working with real-time data, cloud computing, unstructured data, as well as statistical aspects — [and] working with statistical models like regression, optimization, clustering, decision trees, random forests, etc.
Data science is probably not a good career choice for people who don’t like or are not proficient at mathematics. Moreover, the data scientist whiz is one who excels at mathematics and statistics while having an ability to collaborate closely with line-of-business executives to communicate what is actually happening in the “black box” of complex equations in a manner that provides reassurance that the business can trust the outcomes and recommendations
4. Machine learning, deep learning, AI
Industries are moving extremely fast in these areas because of increased computing power, connectivity, and huge volumes of data being collected. A data scientist needs to stay in front of the curve in research, as well as understand what technology to apply when. Also, too many times a data scientist will apply something ‘sexy’ and new when the actual problem they are solving is much less complex.
Data scientists need to have a deep understanding of the problem to be solved, and the data itself will speak to what’s needed. Furthermore, being aware of the computational cost to the ecosystem, interpretability, latency, bandwidth, and other system boundary conditions — as well as the maturity of the customer — itself, helps the data scientist understand what technology to apply. That’s true as long as they understand the technology.
The importance of communication skills bears repeating. Virtually nothing in technology today is performed in a vacuum; there’s always some integration between systems, applications, data and people. Data science is no different, and being able to communicate with multiple stakeholders using data is a key attribute.
6. Data architecture
It is imperative that the data scientist understands what is happening to the data from inception to model to a business decision. Additionally, to not understand the architecture can have a serious impact on sample size inferences and assumptions, often leading to incorrect results and decisions.
As we have seen, mathematics is a crucial skill of a data scientist among many others. Agreed it is not everything that a data scientist may require. Hence, we will explore more on the usage of mathematics in data science. Also, this will help us to answer our question better!
Application of maths in data science and AI
Modelling a process (physical or informational) by probing the underlying dynamics
Rigorously estimating the quality of the data source
Quantifying the uncertainty around the data and predictions
Identifying the hidden pattern from the stream of information
Understanding the limitation of a model
Understanding mathematical proof and the abstract logic behind it
What all Maths You Must Know?
1. Linear algebra
You need to be familiar with linear algebra if you want to work in data science and machine learning because it helps deal with matrices — mathematical objects consisting of multiple numbers organised in a grid. Also, the data collected by a data scientist naturally comes in the form of a matrix — the data matrix — of n observations by p features, thus an n-by-p grid.
2. Probability theory
Probability theory — even the basic, not yet measure-theoretic probability theory — helps the data scientist deal with uncertainty and express it in models. Frequentists, Bayesian, and indeed quantum physicists argue to this day what probability really is (in many languages, such as Russian and Ukrainian, the word for probability comes from “having faith”), whereas pragmatists, such as Andrey Kolmogorov, shirk the question, postulate some axioms that describe how probability behaves (rather than what it is) and say: stop asking questions, just use the axioms.
After probability theory, there comes statistics. As Ian Hacking remarked, “The quiet statisticians have changed our world — not by discovering new facts or technical developments, but by changing the ways that we reason, experiment, and form opinions”. Read Darrell Huff’s How to Lie with Statistics — if only to learn how to be truthful and how to recognise the truth — just as Moses learned “all the wisdom of the Egyptians” — in order to reject it.
4. Estimation theory
A particular branch of statistics — estimation theory — had been largely neglected in mathematical finance, at a high cost. It tells us how well we know a particular number: what is the error present in our estimates? How much of it is due to bias and how much due to variance?
Also, going beyond classical statistics, in machine learning, we want to minimise the error on new data — out-of-sample — rather than on the data that we have already seen — in-sample. As someone remarked, probably Niels Bohr or Piet Hein, “prediction is very difficult, especially about the future.”
5. Optimization theory
You can spend a lifetime studying this. Much of machine learning is about optimization — we want to find the weights that give the best (in optimisation speak, optimal) performance of a neural network on new data, so naturally, we have to optimise — perhaps with some form of regularisation. (And before you have calibrated that long short-term memory (LSTM) network — have you tried the basic linear regression on your data?)
What you miss on skipping Maths?
No in-depth knowledge of working of ML models
Inability to prove the correctness of your hypothesis
Prone to introducing bias and errors in your analysis
Inefficiency in math-heavy business problems
Some resources to learn maths online
We will divide the resources to 3 sections (Linear Algebra, Calculus, Statistics and probability), the list of resources will be in no particular order, resources are diversified between video tutorials, books, blogs, and online courses.
Used in machine learning (& deep learning) to understand how algorithms work under the hood. Basically, it’s all about vector/matrix/tensor operations, no black magic is involved!
Linear algebra, calculus II, stats and probability are sufficient for understanding and handle 90% of machine learning models. Also, some areas and methods require special insights, for example, Bayesian and variational method require a calculus of variation, MCMC and Gibbs sample require advanced concepts of probability theory, information geometry and submanifolds learning to require differential geometry, kernel theory requires calculus III. Lately, it seems that even abstract algebra is playing a role.
Aditionally, not knowing maths may help you in reaching low-level positions in data science or solving some dummy projects. But in the long run, it will be maths only which will help you in scaling your career up!
Follow this link, if you are looking to learn more about data science online!