9923170071 / 8108094992 info@dimensionless.in
7 Technical Concept Every Data Science Beginner Should Know

7 Technical Concept Every Data Science Beginner Should Know

Welcome to Data Science!

 

So you want to learn data science but you don’t know where to start? Or you are a beginner and you want to learn the basic concepts? Welcome to your new career and your new life! You will discover a lot of things on your journey to becoming a data scientist and being part of a new revolution. I am a firm believer that you can learn data science and become a data scientist regardless of your age, your background, your current knowledge level, your gender, and your current position in life. I believe – from experience – that anyone can learn anything at any stage in their lives. What is required is just determination, persistence, and a tireless commitment to hard work. Nothing else matters as far as learning new things – or learning data science – is concerned. Your commitment, persistence, and your investment in your available daily time is enough.

I hope you understood my statement. Anyone can learn data science if you have the right motivation. In fact, I believe anyone can learn anything at any stage in their lives, if they invest enough time, effort and hard work into it, along with your current occupation. From my experience, I strongly recommend that you continue your day job and work on data science as a side hustle, because of the hard work that will be involved. Your commitment is more important than your current life situation. Carrying on a full-time job and working on data science part-time is the best way to go if you want to learn in the best possible manner.

 

Technical Concepts of Data Science

So what are the important concepts of data science that you should know as a beginner? They are, in order of sequential learning, the following:

  1. Python Programming
  2. R Programming
  3. Statistics & Probability
  4. Linear Algebra
  5. Data Preparation and Data ETL*
  6. Machine Learning with Python and R
  7. Data Visualization and Summary

*Extraction, Transformation, and Loading

Now if you were to look at the above list an go to a library, you would, most likely, come back with 9-10 books at an average of 1000 pages each. Even if you could speed-read, 10,000 pages is a lot to get through. I could list the best books for each topic in this post, but even the most seasoned reader would balk at 10,000 pages. And who reads books these days? So what I am going to give you is a distilled extract on each of those topics. Keep in mind, however, that every topic given above could be a series of blog posts in its own right, and these 80-word paragraphs are just a tiny taste of each topic and there is an ocean of depth involved in every topic. You might ask if that is the case, how can everybody be a possible candidate for data scientist role? Two words: Persistence and Motivation. With the right amount of these two characteristics, anyone can be anything they want to be.

 

1) Python Programming:

Python is one of the most popular programming languages in the world. It is the ABC of data science because Python is the language every beginner starts with on data science. It is universally used for any purposes since it is so amazingly versatile. Python can be used for web applications and websites with Django, microservices with Flask, general programming projects with the standard library from PyPI, GUIs with PyQt5 or Tkinter, Interoperability with Jython (Java), Cython (C) and nearly other programming language are available today.

Of course, Python is the also first language used for data science with the standard stack of scikit-learn (machine learning), pandas (data manipulation), matplotlib and seaborn (visualization) and numpy (vectorized computation). Nowadays, the most common technology used is the Anaconda distribution, available from www.anaconda.com. Current version 2018.12 or Anaconda Distribution 5. To learn more about Python, I strongly recommend the following books: Head First Python and the Python Cookbook.

 

2) R Programming

R is The Best Language for statistical needs since it is a language designed by statisticians, for statisticians. If you know statistics and mathematics well, you will enjoy programming in R. The language gives you the best support available for every probability distribution, statistics functions, mathematical functions, plotting, visualization, interoperability, and even machine learning and AI. In fact, everything that you can do in Python can be done in R. R is the second most popular language for data science in the world, second only to Python. R has a rich ecosystem for every data science requirement and is the favorite language of academicians and researchers in the academic domain.

Learning Python is not enough to be a professional data scientist. You need to know R as well. A good book to start with is R For Data Science, available at Amazon at a very reasonable price. Some of the most popular packages in R that you need to know are ggplot2, ThreeJS, DT (tables), network3D, and leaflet for visualization, dplyr and tidyr for data manipulation, shiny and R Markdown for reporting, parallel, Rcpp and data.table for high performance computing and caret, glmnet, and randomForest for machine learning.

 

3)  Statistics and Probability

This is the bread and butter of every data scientist. The best programming skills in the world will be useless without knowledge of statistics. You need to master statistics, especially practical knowledge as used in a scientific experimental analysis. There is a lot to cover. Any subtopic given below can be a blog-post in its own right. Some of the more important areas that a data scientist needs to master are:

  1. Analysis of Experiments
  2. Tests of Significance
  3. Confidence Intervals
  4. Probability Distributions
  5. Sampling Theory
  6. Central Limit Theorem
  7. Bell Curve
  8. Dimensionality Reduction
  9. Bayesian Statistics

Some places on the Internet to learn Statistics from are the MIT OpenCourseWare page Introduction to Statistics and Probability, and the Khan Academy Statistics and Probability Course. Good books to learn statistics: Naked Statistics, by Charles Wheelan which is an awesome comic-like but highly insightful book which can be read enjoyably by anyone including those from non-technical backgrounds and Practical Statistics for Data Scientists by Peter Bruce and Andrew Bruce.

 

4) Linear Algebra

Succinctly, linear algebra is about vectors, matrices and the operations that can be performed on vectors and matrices. This is a fundamental area for data science since every operation we do as a data scientist has a linear algebra background, or, as data scientists, we usually work with collections of vectors or matrices. So we have the following topics in Linear Algebra, all of which are covered in the following world-famous book, Linear Algebra and its Applications by Gilbert Strang, an MIT professor. You can also go to the popular MIT OpenCourseWare page, Linear Algebra (MIT OCW). These two resources cover everything you need to know. Some of the most fundamental concepts that you can also Google or bring up on Wikipedia are:

  1. Vector Algebra
  2. Matrix Algebra
  3. Operations on Matrices
  4. Determinants
  5. Eigenvalues and Eigenvectors
  6. Solving Linear Systems of Equations
  7. Computer-Aided Algebra Software (Mathematica, Maple, MATLAB, etc)

 

5) Data Preparation and Data ETL (Extraction, Transformation, and Loading)

By IAmMrRob on Pixabay

 

Yes – welcome to one of the more infamous sides of data science! If data science has a dark side, this is it. Know for sure that unless your company has some dedicated data engineers who do all the data munging and data wrangling for you, 90% of your time on the job will be spent on working with raw data. Real world data has major problems. Usually, it’s unstructured, in the wrong formats, poorly organized, contains many missing values, contains many invalid values, and contains types that are not suitable for data mining.

Dealing with this problem takes up a lot of the time of a data scientist. And your data scientist’s analysis has the potential to go massively wrong when there is invalid and missing data. Practically speaking, unless you are unusually blessed, you will have to manage your own data, and that means conducting your own ETL (Extraction, Transformation, and Loading). ETL is a data mining and data warehousing term that means loading data from an external data store or data mart into a form suitable for data mining and in a state suitable for data analysis (which usually involves a lot of data preprocessing). Finally, you often have to load data that is too big for your working memory – a problem referred to as external loading. During your data wrangling phase, be sure to look into the following components:

  1. Missing data
  2. Invalid data
  3. Data preprocessing
  4. Data validation
  5. Data verification
  6. Automating the Data ETL Pipeline
  7. Automation of Data Validation and Verification

Usually, expert data scientists try to automate this process as much as possible, since a human being would be wearied by this task very fast and is remarkably prone to errors, which will not happen in the case of a Python or an R script doing the same operations. Be sure to try to automate every stage in your data processing pipeline.

 

6) Machine Learning with Python and R

An expert machine learning scientist has to be proficient in the following areas at the very least:

Data Science Topics Listing

Data Science Topics Listing – Thomas

 

Now if you are just starting out in Machine Learning (ML), Python, and R, you will gain a sense of how huge the field is and the entire set of lists above might seem more like advanced Greek instead of Plain Jane English. But not to worry; there are ways to streamline your learning and to consume as little time as possible in learning or becoming able to learn nearly every single topic given above. After you learn the basics of Python and R, you need to go on to start building machine learning models. From experience, I suggest you break up your time into 50% of Python and 50% of R and spend as much time as possible spending time without switching your languages or working between languages. What do I mean? Spend maximum time learning one programming language at one time. That will prevent syntax errors and conceptual errors and language confusion problems.

Now, on the job, in real life, it is much more likely that you will work in a team and be responsible for only one part of the work. However, if your working in a startup or learning initially, you will end up doing every phase of the work yourself. Be sure to give yourself time to process information and to spend sufficient time for your brain to rest and get a handle on the topics you are trying to learn. For more info, do check out the Learning How to Learn MOOC on Coursera, which is the best way to learn mathematical or scientific topics without ending up with burn out. In fact, I would recommend this approach to every programmer out there trying to learn a programming language, or anything considered difficult, like Quantum Mechanics and Quantum Computation or String Theory, or even Microsoft F# or Microsoft C# for a non-Java programmer.

I strongly recommend the book, Hands-On Machine Learning with Scikit-Learn and TensorFlow to learn Python for Data Science. The R book was given earlier in the section on R.

 

7) Data Visualization and Summary

Common tools that you have with which you can produce powerful visualizations include:

  1. Matplotlib
  2. Seaborn
  3. Bokeh
  4. ggplot2
  5. plot.ly
  6. D3.js
  7. Tableau
  8. Google Data Studio
  9. Microsoft Power BI Desktop

Some involve coding, some are drag-and-drop, some are difficult for beginners, some have no coding at all. All of these tools will help you with data visualization. But one of the most overlooked but critical practical functions of a data scientist has been included under this heading: summarisation. 

Summarisation means the practical result of your data science workflow. What does the result of your analysis mean for the operation of the business or the research problem that you are currently working on? How do you convert your result to the maximum improvement for your business? Can you measure the impact this result will have on the profit of your enterprise? If so, how? Being able to come out of a data science workflow with this result is one of the most important capacities of a data scientist. And most of the time, efficient summarisation = excellent knowledge of statistics. Please know for sure that statistics is the start and the end of every data science workflow. And you cannot afford to be ignorant about it. Refer to the section on statistics or google the term for extra sources of information.

How Can I Learn Everything Above In the Shortest Possible Time?

You might wonder – How can I learn everything given above? Is there a course ora pathway to learn every single concept described in this article at one shot? It turns out – there is. There is a dream course for a data scientist that contains nearly everything talked about in this article.

Want to Become a Data Scientist? Welcome to Dimensionless Technologies! It just so happens that the course: Data Science using Python and R, a ten-week course that includes ML, Python and R programming, Statistics, Github Account Project Guidance, and Job Placement, offers nearly every component spoken about above, and more besides. You don’t know to buy the books or do any of the courses other than this to learn the topics in this article. Everything is covered by this single course, tailormade to convert you to a data scientist within the shortest possible time. For more, I’d like to refer you to the following link:

Data Science using R & Python

Does this seem too good to be true? Perhaps, because this is a paid course. With a scholarship concession, you could end up paying around INR 40,000 for this ten-week course, two weeks of which you can register for 5,000 and pay the remainder after two weeks trial period to see if this course really suits you. If it doesn’t, you can always drop out after two weeks and be poorer by just 5k. But in most cases, this course has been found to carry genuine worth. And nothing worthwhile was achieved without some payment, right?

In case you want to learn more about data science, please check out the following articles:

Data Science: What to Expect in 2019

and:

Big Data and Blockchain

Also, see:

AI and intelligent applications

and:

Evolution of Chatbots & their Performance

All the best, and enjoy data science. Every single day of your life!

Must-Have Resources to Become a Data Scientist

Must-Have Resources to Become a Data Scientist

Become a Professional

Now there are a multitude of data science resources out there, all of whom claim to be the “best possible introductory to advanced material and courseware on the subject of data science”. Now I’ve made mistakes in choosing my data science references to buy and keep (and use) but I’ll be sharing what I’ve learned through experience to be the most effective for these particular topics. This list is both effective and born out of experience by going through them one at a time. The list of resources contains the following items:

eBooks (or Books – your choice):

  1. Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville
  2. Introduction to Machine Learning with Python by Muller & Guido.
  3. R for Data Science by Wickham & Grolemund.
  4. Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurelion Geron
  5. Statistics by Freedman, Pisani, & Purves

Websites:

  1. www.stackoverflow.com (for doubts and errors during coding)
  2. www.kaggle.com (for Data Science Competitions and worldwide rankings)

Online Courses:

  1. Dimensionless.in
    1. Data Science with Python and R: https://dimensionless.in/data-science-using-r-python/ 
    2. Big Data Analytics and NLP: https://dimensionless.in/big-data-analytics-nlp/
    3. Deep Learning: https://dimensionless.in/deep-learning/

 Books for Data Science

1. Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville
Now there are not many books that I would recommend for a professional data scientist, but this book is written by an authority with 15 years of experience in the data science field working on some seriously large-scale projects for the best companies in the world. And it shows. This single book contains some of the latest and the best methods to achieve what you need to be a professional data scientist. And it’s not just teaching theory. Every chapter has multiple case studies taken from the experiences in the industry. Vincent Granville is recognized worldwide as one of the best-known resource talents in data science. The level is a little advanced, and it is not recommended for beginners. But this is the perfect book for advanced-intermediate to professional data scientists. If you want to know how to work professionally as a data scientist, this book is for you. But this is only for intermediate, advanced, and professional data scientists since you need to know the basics before starting on this book.

2. Introduction to Machine Learning with Python – A Guide for Data Scientists (Muller & Guido)

Now, this is a book for beginners, with just a basic knowledge of numpy, pandas and matplotlib required. This is perhaps the most effective way to learn the Scikit-Learn data science library since the authors are two of the core contributors to the scikit-learn package as an open source project. They literally know the library inside out, since they both contributed heavily to creating it! The explanations are simple and the time spent working on the exercises and source code in this book will be highly beneficial if you want to master scikit-learn and its associated libraries.

3. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Wickham & GrolemundThis is another beginner-friendly book, teaching all the basics of R clearly and concisely for those with basic programming skills. R is a language intended for manipulation of raw data, and it is an excellent complement to your toolset if you already know Python and are preparing for a career as a data scientist. The IDE used is RStudio, which is bundled with the Anaconda distribution of Python and ML libraries. Both authors are chief scientists involved in the RStudio software development team and are also members of the R Foundation. This book gets you up and running in R effectively and quickly.

4. Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurelion Geron

 

This book has received massive acclaim from the data science community for the breadth of knowledge which it provides and is one of the best books on this topic till date. TensorFlow coverage is excellent, and there are methodologies that this book teaches to get your data science project perfectly executed immediately. The TensorFlow (with some Keras) coverage is the most simple and easy to understand among all the various TensorFlow tutorials I personally have found both on the Web as well as in the few available ebooks. If you want to work in Deep Learning but don’t know how to get started, this book is for you (it covers Deep Learning as well)!

5. Statistics, 4th Ed. by Freedman, Pisani, and Purves

There is a wide selection of Statistics books for data scientists, but this book is highly recommended. For many simple reasons:

  1. It does not use equations but real-life examples
  2. All the material is based on real-world scenarios
  3. Inferential Statistics coverage is excellent
  4. Study of Design of Experiments is the entry point for the book
  5. Tests for significance and p-values is clearly explained without formulas through examples
  6. The material is presented in such a way that the reader becomes excellent at off-the-cuff estimation.
  7. Most graphics in the book can be drawn easily by hand.
  8. Excellent set of exercises and solutions with real-world applications.
  9. Plain English words, real-life stories, understanding concepts through applications and simple explanations.
  10. Few to almost very little formulae.
  11. Perfect for someone who is approaching the subject for the first time.

And the icing on the cake is that it is a very enjoyable read! 🙂  

Websites for Data Science

1. www.stackoverflow.com

Once upon a time, if a developer was stuck on a programming problem, he would have to go through several textbooks each over 500 pages long to find the answer to his problem. Not any more! StackOverflow is a site that is a platform for questions and doubts on nearly every type of programming language available, including Python, R, scikit-learn, TensorFlow, Keras, pandas, numpy, scipy, Theano, PyTorch, matplotlib and dozens more (both languages and libraries). It shows the power of crowd-sourcing problems since it is much easier to find the answer to a problem from 50,000 people than just four or five which would be the case if you were studying from a few teachers. You can simply copy-paste your error message in your data science compiler tool into StackOverflow and the site will return fully worked out and clearly explained solutions to your problem.

The way I see it – StackOverflow was a defining moment in programming. Once upon a time, debugging was a challenge. Now for nearly 90% or more of all bugs and errors, StackOverflow has your answer, explained in clear English, with the corrected source code. What more could you want? These days, anyone can become a developer in any language, thanks to this single site. And the concept has become so popular that there are now a multitude of crowd-sourcing answers to questions platform websites such as www.math-exchange.com, www.stack-exchange.com, and around 10 to 20 more sites that provide this functionality for that particular field, be it Mathematics, English, and even Christianity!

2. www.kaggle.com

These days, just having an impressive profile on Kaggle will be enough to land you a job interview at the very least. Kaggle is a site that has been hosting data science competitions for many years. The competition is immense and intense, but so are the tutorials and the articles are also equally powerful and instructive. If you want to be a data scientist, not having a decent Kaggle profile is inexcusable. Kaggle will be like a showcase of your data science skills to the entire world. Even if you don’t rank very high, consistency and practice can get you there more often than not.

And there is another side to it, a course designed purely for the purpose of winning data science competitions, available on Coursera (How to Win Data Science Competitions: Learn from Top Kagglers), available on this link: https://www.coursera.org/learn/competitive-data-scienceHowever, this course is not for beginners, it is only for those who already have a strong background in machine learning and machine learning libraries with practical programming knowledge experience.

Courses in Data Science

Dimensionless.in

Dimensionless.in is an elite data science training company that imparts industry level experience and knowledge to those with a real thirst to learn. Training is given from the basics, leading to a strong foundation. Now, anyone with discipline and persistence can learn data science and become a data scientist. All the faculty in the courses are IIT alumni. The training received is personalized to cater to the needs of each student. There are only 20-30 students in a single batch.

Going by popular wisdom, this level of detail and attention is not available in any of the major data science training centers and course curators. Most organizations focus on marketing and volume (quantity). But Dimensionless Technologies has their focus not on quantity but in quality. The following three major courses are offered:

  1. Data Science with Python and R: https://dimensionless.in/data-science-using-r-python/ 
  2. Big Data Analytics and NLP: https://dimensionless.in/big-data-analytics-nlp/
  3. Deep Learning: https://dimensionless.in/deep-learning/

An in-depth explanation for each course is given at each of the links. Do check them out and have an in-depth view of the potential that is here to be tapped, for your benefit.

And Finally…

AI & ML (Artificial Intelligence & Machine Learning) is the future for nearly every single industry. The question on every CEO’s and CIO’s mind will be this: Why should we set up a staffed division in our company for any role when an automated machine with just a high one-time investment (for which operating costs are literally non-existent (compared to paying a salary to 100 staff members – you can do the math) can do the same job for us more reliably, more efficiently, more consistently, and more accurately than people when they do it as staff or employees? That burning question is racing through nearly every industry in the world right now. Thousands of jobs will be automated and the biggest demand in all sectors will be for machine learning experts who are also highly skilled in domain knowledge of that company (say hospitals, for e.g.).

Don’t be scared of the incoming changes. Change is humanity’s best friend. Without change, life would be boring. Changes are not just challenges, they are also opportunities for much higher-paying and much less laborious jobs than the jobs you hold currently. And, if by chance, you happen to be a student reading this article, you now know which industry you should focus on –  completely. All the best, and remember to enjoy the process of learning. Regardless of your age, this is the best time to be alive – ever. Because domain knowledge is available more widely today than at any time in the past. Be enthusiastic. Be positive. Be disciplined. Be focused. And make the right choices at the right times – and no, its never too late when you have quality trainers ready to mentor you. May the thrill of learning a completely new concept with truly enlightened insight never leave you. Once again, all the best.

Robots and Automation