Introduction to Data Science
Data Science. If you’re in the tech field or just an avid follower of technology, you’ve probably heard these words more than once over the past few years; even if just as a buzzword. Data Science is one of the hottest domains in the technology field as of this writing. Data Scientist was named as the sexiest job of the 21st century by Harvard Business Review in 2012. The demand of data scientists is increasing in the market and people are just flocking into the field. But do we know what data science really is? In this article, we’ll be exploring on some of the basic concepts of data science.
Let’s break it down – it’s data and science. So first, what is data? Since data is one of the most fundamental concepts, it would be best to define it via examples. You write a post on Facebook, you post a picture on Instagram; you’re generating data. You upload a video on Youtube, you send a voice message to a friend; you’re generating data. You maintain a log of the number of calories you burn every day; you’re generating data. Whatever you do, if you leave a footprint in the digital realm, you’re generating data. And if you’re really active on the Internet, you’re generating lots of data each day. Now think about this. You’re just one person. There are over 3 billion people who have access to the Internet currently as per the International Telecommunication Union (ITU). And these people are all generating data. Likewise, the government is generating data like the census, healthcare is generating data about diseases and treatments, the stock market is generating data. In addition, with the advent of the Internet of Things (IoT), now devices are generating lots and lots of data. Since the size of data is now so big, we call this Big Data.
But is the data useful to us in any manner? Having the raw data alone is not useful; what is useful is information. Having a log of the number of calories burnt every day is of no use if you cannot generate valuable insights from it. It’s just a bunch of numbers we don’t care about. What we care about is information. Have we burnt enough for the day? How far off the goal are we in getting slimmer? Can we have a cheat day tomorrow? These are the kind of information we care about. So how do we generate these insights? How can we go from data to information? Enters science.
Science, as we know, is all about tools, techniques, principles, experiments, observations.
Science uses logic and facts on the available observations to solve a problem at hand.
Data science then, is just the use of science in data; the process of analyzing the raw data available to us, preparing the data for further processing, using tools and techniques to generate a model or a kind of relationship amongst the processed data and finally derive valuable information that would guide us towards making better decisions. Data science is thus important because the decision taken is backed by actual solid data, not because of the instincts of the decision maker.
Components of Data Science
Now that we’re convinced that data science is important, the next question is, what sort of skills are required to become a data scientist? What are the areas that data science covers? There’s this famous Venn-diagram called the Data Science Venn diagram that shows the various areas that data science touches upon. The Venn diagram was first introduced by Drew Conway and over the years, different versions of the Venn diagram have been presented. You can check out https://www.kdnuggets.com/2016/10/battle-data-science-Venn-diagrams.html for some interesting Venn diagrams depicting Data Science. Now, because data science touches so many different fields, it is hard to argue about the correct representation. Below, we present a simplified version of the Venn diagram.
Let’s explain the Venn diagram with simple questions – what, where, why and how?
What is the problem at hand? Where do we want to apply data science? Maybe we want to make stock market predictions. Why do we want to apply data science to this field? Maybe because we want better recommendations on where we should put our money. How are we going to get better recommendations? By collecting past data from the stock market for historical trends, current data for the latest trends. We need to know the ins and outs of the stock market, we need to know about the different parameters that are involved in and that affect the stock market fluctuations. That is Domain Expertise.
How are we going to generate recommendations from the data we collected? By using certain algorithms. Why are we using that algorithm? Because the pattern of the data shows that using that particular algorithm gives us the best analysis of the data. What does the algorithm give us? It gives us a model that provides a relationship amongst the various parameters involved. What do we do with that model? We make predictions. That is Math & Statistics.
What are the tools we are going to use? What programming language? What will data structures be used to represent the data and the outcome? How are we going to programmatically represent the algorithms? Maybe we’ll use Python. Why are we using Python? Because it has abundant built-in libraries for data science and it’s easy to learn. That is Computer Science.
When you use the correct tools and techniques to represent the most efficient algorithm for solving a problem you have substantial expertise and data on, that is Data Science.
Who can get into Data Science?
For people that are not great at math or have poor programming skills, the above Venn diagram can be somehow daunting. Is data science not for them? There is this misconception that to be a data scientist you need a PhD or be a pro at handling complex mathematical equations and functions. This is incorrect. While it’s true that having a strong mathematical grip does give you that added advantage, the data science ecosystem has matured enough to open its doors for everyone. Programming languages have in-built libraries that you can use to build effective models and solve the problem of your interest without getting into the nitty-gritty details of mathematics. Mathematics becomes important when you want to tweak an algorithm or build your custom models to handle complex problems. It could also aid to increase the efficiency of an algorithm as per the custom requirement of the problem if your foundations of mathematics are strong. General knowledge of linear algebra, probability, statistics and basic high school mathematics is however recommended.
Another misconception is that only the people from Computer Science background can get into data science. This again is incorrect. The range of applications of data science is varied. People are using data science to solve problems as diverse as medical diagnosis, analysis of epidemics and natural calamities, market analysis and gaming to name a few. So, people from all types of backgrounds are getting into data science – Chemical Engineers, Physicists, people from Finance and Business Administration, Artists, people from the Medical field and so many more. These people do not necessarily have excellent skills in programming or math. But with the data science community being so welcoming and the tools being open source and readily available, it’s not really that hard to get started. Plus they bring with them, their domain expertise.
What tools and resources are available to get started in Data Science?
Data science is all about getting valuable insights from data. As a data scientist, you’re first and foremost a problem solver. What types of tools and techniques you use is secondary, but how you do it is important. Of course, a single person cannot solve all kinds of problems. So, it is important that we have a problem to apply data science to and that we have substantial expertise in the subject. Or at least a way to extract substantial data from the field. After that, we just build on the basics to solve that particular problem. If you have expertise or substantial data in the medical field, you can apply data science to medical diagnosis; you wouldn’t care about stock market predictions because that is not your domain currently. Also, data scientists have different roles – collecting data, data cleaning, analytics, testing and applying machine learning. It is rarely the case that a single person can possess all these skills. So, companies tend to have a team of data scientists with each member having a definite role rather a single superstar data scientist.