9923170071 / 8108094992 info@dimensionless.in
7 Technical Concept Every Data Science Beginner Should Know

7 Technical Concept Every Data Science Beginner Should Know

Welcome to Data Science!

 

So you want to learn data science but you don’t know where to start? Or you are a beginner and you want to learn the basic concepts? Welcome to your new career and your new life! You will discover a lot of things on your journey to becoming a data scientist and being part of a new revolution. I am a firm believer that you can learn data science and become a data scientist regardless of your age, your background, your current knowledge level, your gender, and your current position in life. I believe – from experience – that anyone can learn anything at any stage in their lives. What is required is just determination, persistence, and a tireless commitment to hard work. Nothing else matters as far as learning new things – or learning data science – is concerned. Your commitment, persistence, and your investment in your available daily time is enough.

I hope you understood my statement. Anyone can learn data science if you have the right motivation. In fact, I believe anyone can learn anything at any stage in their lives, if they invest enough time, effort and hard work into it, along with your current occupation. From my experience, I strongly recommend that you continue your day job and work on data science as a side hustle, because of the hard work that will be involved. Your commitment is more important than your current life situation. Carrying on a full-time job and working on data science part-time is the best way to go if you want to learn in the best possible manner.

 

Technical Concepts of Data Science

So what are the important concepts of data science that you should know as a beginner? They are, in order of sequential learning, the following:

  1. Python Programming
  2. R Programming
  3. Statistics & Probability
  4. Linear Algebra
  5. Data Preparation and Data ETL*
  6. Machine Learning with Python and R
  7. Data Visualization and Summary

*Extraction, Transformation, and Loading

Now if you were to look at the above list an go to a library, you would, most likely, come back with 9-10 books at an average of 1000 pages each. Even if you could speed-read, 10,000 pages is a lot to get through. I could list the best books for each topic in this post, but even the most seasoned reader would balk at 10,000 pages. And who reads books these days? So what I am going to give you is a distilled extract on each of those topics. Keep in mind, however, that every topic given above could be a series of blog posts in its own right, and these 80-word paragraphs are just a tiny taste of each topic and there is an ocean of depth involved in every topic. You might ask if that is the case, how can everybody be a possible candidate for data scientist role? Two words: Persistence and Motivation. With the right amount of these two characteristics, anyone can be anything they want to be.

 

1) Python Programming:

Python is one of the most popular programming languages in the world. It is the ABC of data science because Python is the language every beginner starts with on data science. It is universally used for any purposes since it is so amazingly versatile. Python can be used for web applications and websites with Django, microservices with Flask, general programming projects with the standard library from PyPI, GUIs with PyQt5 or Tkinter, Interoperability with Jython (Java), Cython (C) and nearly other programming language are available today.

Of course, Python is the also first language used for data science with the standard stack of scikit-learn (machine learning), pandas (data manipulation), matplotlib and seaborn (visualization) and numpy (vectorized computation). Nowadays, the most common technology used is the Anaconda distribution, available from www.anaconda.com. Current version 2018.12 or Anaconda Distribution 5. To learn more about Python, I strongly recommend the following books: Head First Python and the Python Cookbook.

 

2) R Programming

R is The Best Language for statistical needs since it is a language designed by statisticians, for statisticians. If you know statistics and mathematics well, you will enjoy programming in R. The language gives you the best support available for every probability distribution, statistics functions, mathematical functions, plotting, visualization, interoperability, and even machine learning and AI. In fact, everything that you can do in Python can be done in R. R is the second most popular language for data science in the world, second only to Python. R has a rich ecosystem for every data science requirement and is the favorite language of academicians and researchers in the academic domain.

Learning Python is not enough to be a professional data scientist. You need to know R as well. A good book to start with is R For Data Science, available at Amazon at a very reasonable price. Some of the most popular packages in R that you need to know are ggplot2, ThreeJS, DT (tables), network3D, and leaflet for visualization, dplyr and tidyr for data manipulation, shiny and R Markdown for reporting, parallel, Rcpp and data.table for high performance computing and caret, glmnet, and randomForest for machine learning.

 

3)  Statistics and Probability

This is the bread and butter of every data scientist. The best programming skills in the world will be useless without knowledge of statistics. You need to master statistics, especially practical knowledge as used in a scientific experimental analysis. There is a lot to cover. Any subtopic given below can be a blog-post in its own right. Some of the more important areas that a data scientist needs to master are:

  1. Analysis of Experiments
  2. Tests of Significance
  3. Confidence Intervals
  4. Probability Distributions
  5. Sampling Theory
  6. Central Limit Theorem
  7. Bell Curve
  8. Dimensionality Reduction
  9. Bayesian Statistics

Some places on the Internet to learn Statistics from are the MIT OpenCourseWare page Introduction to Statistics and Probability, and the Khan Academy Statistics and Probability Course. Good books to learn statistics: Naked Statistics, by Charles Wheelan which is an awesome comic-like but highly insightful book which can be read enjoyably by anyone including those from non-technical backgrounds and Practical Statistics for Data Scientists by Peter Bruce and Andrew Bruce.

 

4) Linear Algebra

Succinctly, linear algebra is about vectors, matrices and the operations that can be performed on vectors and matrices. This is a fundamental area for data science since every operation we do as a data scientist has a linear algebra background, or, as data scientists, we usually work with collections of vectors or matrices. So we have the following topics in Linear Algebra, all of which are covered in the following world-famous book, Linear Algebra and its Applications by Gilbert Strang, an MIT professor. You can also go to the popular MIT OpenCourseWare page, Linear Algebra (MIT OCW). These two resources cover everything you need to know. Some of the most fundamental concepts that you can also Google or bring up on Wikipedia are:

  1. Vector Algebra
  2. Matrix Algebra
  3. Operations on Matrices
  4. Determinants
  5. Eigenvalues and Eigenvectors
  6. Solving Linear Systems of Equations
  7. Computer-Aided Algebra Software (Mathematica, Maple, MATLAB, etc)

 

5) Data Preparation and Data ETL (Extraction, Transformation, and Loading)

By IAmMrRob on Pixabay

 

Yes – welcome to one of the more infamous sides of data science! If data science has a dark side, this is it. Know for sure that unless your company has some dedicated data engineers who do all the data munging and data wrangling for you, 90% of your time on the job will be spent on working with raw data. Real world data has major problems. Usually, it’s unstructured, in the wrong formats, poorly organized, contains many missing values, contains many invalid values, and contains types that are not suitable for data mining.

Dealing with this problem takes up a lot of the time of a data scientist. And your data scientist’s analysis has the potential to go massively wrong when there is invalid and missing data. Practically speaking, unless you are unusually blessed, you will have to manage your own data, and that means conducting your own ETL (Extraction, Transformation, and Loading). ETL is a data mining and data warehousing term that means loading data from an external data store or data mart into a form suitable for data mining and in a state suitable for data analysis (which usually involves a lot of data preprocessing). Finally, you often have to load data that is too big for your working memory – a problem referred to as external loading. During your data wrangling phase, be sure to look into the following components:

  1. Missing data
  2. Invalid data
  3. Data preprocessing
  4. Data validation
  5. Data verification
  6. Automating the Data ETL Pipeline
  7. Automation of Data Validation and Verification

Usually, expert data scientists try to automate this process as much as possible, since a human being would be wearied by this task very fast and is remarkably prone to errors, which will not happen in the case of a Python or an R script doing the same operations. Be sure to try to automate every stage in your data processing pipeline.

 

6) Machine Learning with Python and R

An expert machine learning scientist has to be proficient in the following areas at the very least:

Data Science Topics Listing

Data Science Topics Listing – Thomas

 

Now if you are just starting out in Machine Learning (ML), Python, and R, you will gain a sense of how huge the field is and the entire set of lists above might seem more like advanced Greek instead of Plain Jane English. But not to worry; there are ways to streamline your learning and to consume as little time as possible in learning or becoming able to learn nearly every single topic given above. After you learn the basics of Python and R, you need to go on to start building machine learning models. From experience, I suggest you break up your time into 50% of Python and 50% of R and spend as much time as possible spending time without switching your languages or working between languages. What do I mean? Spend maximum time learning one programming language at one time. That will prevent syntax errors and conceptual errors and language confusion problems.

Now, on the job, in real life, it is much more likely that you will work in a team and be responsible for only one part of the work. However, if your working in a startup or learning initially, you will end up doing every phase of the work yourself. Be sure to give yourself time to process information and to spend sufficient time for your brain to rest and get a handle on the topics you are trying to learn. For more info, do check out the Learning How to Learn MOOC on Coursera, which is the best way to learn mathematical or scientific topics without ending up with burn out. In fact, I would recommend this approach to every programmer out there trying to learn a programming language, or anything considered difficult, like Quantum Mechanics and Quantum Computation or String Theory, or even Microsoft F# or Microsoft C# for a non-Java programmer.

I strongly recommend the book, Hands-On Machine Learning with Scikit-Learn and TensorFlow to learn Python for Data Science. The R book was given earlier in the section on R.

 

7) Data Visualization and Summary

Common tools that you have with which you can produce powerful visualizations include:

  1. Matplotlib
  2. Seaborn
  3. Bokeh
  4. ggplot2
  5. plot.ly
  6. D3.js
  7. Tableau
  8. Google Data Studio
  9. Microsoft Power BI Desktop

Some involve coding, some are drag-and-drop, some are difficult for beginners, some have no coding at all. All of these tools will help you with data visualization. But one of the most overlooked but critical practical functions of a data scientist has been included under this heading: summarisation. 

Summarisation means the practical result of your data science workflow. What does the result of your analysis mean for the operation of the business or the research problem that you are currently working on? How do you convert your result to the maximum improvement for your business? Can you measure the impact this result will have on the profit of your enterprise? If so, how? Being able to come out of a data science workflow with this result is one of the most important capacities of a data scientist. And most of the time, efficient summarisation = excellent knowledge of statistics. Please know for sure that statistics is the start and the end of every data science workflow. And you cannot afford to be ignorant about it. Refer to the section on statistics or google the term for extra sources of information.

How Can I Learn Everything Above In the Shortest Possible Time?

You might wonder – How can I learn everything given above? Is there a course ora pathway to learn every single concept described in this article at one shot? It turns out – there is. There is a dream course for a data scientist that contains nearly everything talked about in this article.

Want to Become a Data Scientist? Welcome to Dimensionless Technologies! It just so happens that the course: Data Science using Python and R, a ten-week course that includes ML, Python and R programming, Statistics, Github Account Project Guidance, and Job Placement, offers nearly every component spoken about above, and more besides. You don’t know to buy the books or do any of the courses other than this to learn the topics in this article. Everything is covered by this single course, tailormade to convert you to a data scientist within the shortest possible time. For more, I’d like to refer you to the following link:

Data Science using R & Python

Does this seem too good to be true? Perhaps, because this is a paid course. With a scholarship concession, you could end up paying around INR 40,000 for this ten-week course, two weeks of which you can register for 5,000 and pay the remainder after two weeks trial period to see if this course really suits you. If it doesn’t, you can always drop out after two weeks and be poorer by just 5k. But in most cases, this course has been found to carry genuine worth. And nothing worthwhile was achieved without some payment, right?

In case you want to learn more about data science, please check out the following articles:

Data Science: What to Expect in 2019

and:

Big Data and Blockchain

Also, see:

AI and intelligent applications

and:

Evolution of Chatbots & their Performance

All the best, and enjoy data science. Every single day of your life!

Top 10 Data Science Projects for 2019

Top 10 Data Science Projects for 2019

Introduction

Data scientists are one of the most hirable specialists today, but it’s not so easy to enter this profession without a “Projects” field in your resume. Furthermore, you need the experience to get the job, and you need the job to get the experience. Seems like a vicious circle, right? Also, the great advantage of data science projects is that each of them is a full-stack data science problem. Additionally, this means that you need to formulate the problem, design the solution, find the data, master the technology, build a machine learning model, evaluate the quality, and maybe wrap it into a simple UI. Hence, this is a more diverse approach than, for example, Kaggle competition or Coursera lessons.

Hence, in this blog, we will look at 10 projects to undertake in 2019 to learn data science and improve your understanding of different concepts.

Projects

 

1. Match Career Advice Questions with Professionals in the Field

Problem Statement: The U.S. has almost 500 students for every guidance counselor. Furthermore, youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn. Also, to date, 25,000 create profiles and opt-in to receive emails when a career question is a good fit for them. This is where your skills come in. Furthermore, to help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications for the volunteers seem to have the greatest impact on how many questions are answered.

Your objective: Develop a method to recommend relevant questions to the professionals who are most likely to answer them.

Data: Link

2. Histopathologic Cancer Detection

Problem Statement: In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. Also, the data for this competition is a slightly different version of the PatchCamelyon (PCam) benchmark dataset. PCam is highly interesting for both its size, simplicity to get started on, and approachability.

Your objective: Identify metastatic tissue in histopathologic scans of lymph node sections

Data: Link

3. Aerial Cactus Identification

Problem Statement: To assess the impact of climate change on Earth’s flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Furthermore, researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. Moreover, the first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with the creation of an algorithm that can identify a specific type of cactus in aerial imagery.

Your objective: Determine whether an image contains a columnar cactus

Data: Link

4. TMDB Box Office Prediction

Problem Statement: In a world, where movies made an estimate of $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it’s “You had me at ‘Hello. In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Also, the data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. Furthermore, you can collect other publicly available data to use in your model predictions.

Your objective: Can you predict a movie’s worldwide box office revenue?

Data: Link

5. Quora Insincere Questions Classification

Problem Statement: An existential problem for any major website today is how to handle toxic and divisive content. Furthermore, Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, you need to develop models that identify and flag insincere questions. Moreover, to date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

Your objective: Detect toxic content to improve online conversations

Data: Link

6. Store Item Demand Forecasting Challenge

Problem Statement: This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset. You are given 5 years of store-item sales data and asked to predict 3 months of sales for 50 different items at 10 different stores. What’s the best way to deal with seasonality? Should stores be modelled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost? Also, this is a great competition to explore different models and improve your skills in forecasting.

Your Objective: Predict 3 months of item sales at different stores

Data: Link

7. Web Traffic Time Series Forecasting

Problem Statement: This competition focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the competition at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles. Also, the sequential or temporal observations emerge in many key real-world problems, ranging from biological data, financial markets, weather forecasting, to audio and video processing. Moreover, the field of time series encapsulates many different problems, ranging from analysis and inference to classification and forecast. What can you do to help predict future views?

Problem Statement: Forecast future traffic to Wikipedia pages

Data: Link

8. Transfer Learning on Stack Exchange Tags

Problem Statement: What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answer “all pursuits are under the immutable laws of physics” we’ll begrudgingly give you partial credit. Also, If you answer “people chose randomly for a transfer learning competition”, congratulations, we accept your answer and mark the question as solved.

In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from “extreme-tourism Antarctica”? Let’s find out.

Your objective: Predict tags from models trained on unrelated topics

Data: Link

9. Digit Recognizer

Problem Statement: MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. Furthermore, in this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.

Your objective: Learn computer vision fundamentals with the famous MNIST data

Data: Link

10. Titanic: Machine Learning from Disaster

Problem Statement: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Furthermore, this sensational tragedy shocked the international community and led to better safety regulations for ships. Also, one of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive.

Your objective: Predict survival on the Titanic and get familiar with ML basics

Data: Links

Summary

The best way to showcase your Data Science skills is with these 5 types of projects:

  1. Deep Learning
  2. Natural Language Processing
  3. Big Data
  4. Machine Learning
  5. Image Processing

Hence, be sure to document all of these on your portfolio website.

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start

Furthermore, if you want to read more about data science, you can read our blogs here

Also, the following are some blogs you may like to read

Big Data and Blockchain

AI and intelligent applications

How to train a decision tree classifier for churn prediction

 

Data Science: What to Expect in 2019

Data Science: What to Expect in 2019

Introduction

2019 looks to be the year of using smarter technology in a smarter way. Three key trends — artificial intelligence systems becoming a serious component in enterprise tools, custom hardware breaking out for special use-cases, and a rethink on data science and its utility — will all combine into a common theme.

In recent years, we’ve seen all manner of jaw-dropping technology, but the emphasis has been very much on what these gadgets and systems can do and how they do it, with much less attention paid to why.

In this blog, we will explore different areas in data science and figure out our expectations in 2019 in them. Areas include machine learning, AR/VR systems, edge computing etc. Let us go through them one by one

Machine Learning/Deep Learning

Businesses are using machine learning to improve all sorts of outcomes, from optimizing operational workflows and increasing customer satisfaction to discovering to a new competitive differentiator. But, now all the hype around AI is settling. Machine learning is not a cool term anymore. Furthermore, organisations are looking for more ways of identifying more options in the form of agent modelling. Apart from this, more adoption of these algorithms looks very feasible now. Adoption will be seen in new and old industries

Healthcare companies are already big users of AI, and this trend will continue. According to Accenture, the AI healthcare market might hit $6.6 billion by 2021, and clinical health AI applications can create $150 billion in annual savings for the U.S. healthcare economy by 2026.

In retail, global spending on AI will grow to $7.3 billion a year by 2022, up from $2 billion in 2018, according to Juniper Research. This is because companies will invest heavily in AI tools that will help them differentiate and improve the services they offer customers.

In cybersecurity, the adoption of AI brings a boom in startups that are able to raised$3.65 billion in equity funding in the last five years. Cyber AI can help security experts sort through millions of incidents to identify aberrations, risks, and signals of future threats.

And there is even an opportunity brewing in industries facing labour shortages, such as transportation. At the end of 2017, there was a shortage of 51,000 truck drivers (up from a shortage of 36,000 the previous year). And the ATA reports that the trucking industry will need to hire 900,000 more drivers in the next 10 years to keep up with demand. AI-driven autonomous vehicles could help relieve the need for more drivers in the future.

Programming Language

The practice of data science requires the use of analytics tools, technologies and programming languages to help data professionals extract insights and value from data. A recent survey of nearly 24,000 data professionals by Kaggle suggests that Python, SQL and R are the most popular programming languages. The most popular, by far, was Python (83%). Additionally, 3 out of 4 data professionals recommended that aspiring data scientists learn Python first.

Survey results show that 3 out of 4 data professionals would recommend Python as the programming language aspiring data scientists to learn first. The remaining programming languages are recommended at a significantly lower rate (R recommended by 12% of respondents; SQL by 5% of respondents. Anyhow, Python will also boom more in 2019. But, R community too have come up with a lot of recent advancements. With new packages and improvements, R is expected to come closer to python in terms of usage.

Blockchain and Big Data

In recent years, the blockchain is at the heart of computer technologies. It is a cryptographically secure distributed database technology for storing and transmitting information. The main advantage of the blockchain is that it is decentralized. In fact, no one controls the data entering or their integrity. However, these checks run through various computers on the network. These different machines hold the same information. In fact, faulty data on one computer cannot enter the chain because it will not match the equivalent data held by the other machines. To put it simply, as long as the network exists, the information remains in the same state.

Big Data analytics will be essential for tracking transactions and enabling businesses that use the Blockchain to make better decisions. That’s why new Data Intelligence services are emerging to help financial institutions and governments and other businesses discover who they interact with within the Blockchain and discover hidden patterns.

Augmented-Reality/Virtual Reality

The broader the canvas of visualization is, the better the understanding is. That’s exactly what happens when one visualizes big data through the Augmented Reality (AR) and Virtual Reality (VR). A combination of AR and VR could open a world of possibilities to better utilize the data at hand. VR and AR can practically improve the way we perceive data and could actually be the solution to make use of the large unused data.

By presenting the data in the form of 3D, the user will be able to decipher the major takeaways from the data better and faster with easier understanding. Many recent types of research show that the VR and AR has a high sensory impact which promotes faster learning and understanding.

This immersive way of representation of the data enables the analysts to handle the big data more efficiently. It makes the analysis and interpretation more of an experience and realisation that the traditional analysis. Instead of the user seeing numbers and figures, the person will be able to see beyond it and into the facts, happenings and reasons which could revolutionize the businesses.

Edge Computing

Computing infrastructure is an ever-changing landscape of technol­ogy advancements. Current changes affect the way companies deploy smart manufacturing systems to make the most of advancements.

The rise of edge computing capabilities coupled with tradi­tional industrial control system (ICS) architectures provides increasing levels of flexibility. In addition, time-synchronized applications and analytics augment the need for larger Big Data operations in the cloud. This is regardless of cloud premise.

Edge is still in early stage adoption. But, one thing is clear that edge devices are subject to large-scale investments from cloud suppliers to offload bandwidth. Also, there are latency issues due to an explosion of the IoT data in both industrial and commercial applications.

Edge soon will likely increase in adoption where users have questions about the cloud’s specific use case. Cloud-level interfaces and apps will migrate to the edge. Industrial application hosting and analytics will become common at the edge. This will happen using virtual servers and simplified operational technology-friendly hardware and software.

The Rise of Semi-Automated Tools for Data Science

There has been a rise of self-service BI tools such as Tableau, Qlik Sense, Power BI, and Domo. Furthermore, now managers can obtain current business information in graphical form on demand. Although, IT may need to set up a certain amount of setup at the outset. Also, when adding a data source, most of the data cleaning work and analysis can be done by analysts. The analyses can update automatically from the latest data any time they are opened.

Managers can then interact with the analyses graphically to identify issues that need to be addressed. In a BI-generated dashboard or “story” about sales numbers, that might mean drilling down to find underperforming stores, salespeople, and products, or discovering trends in year-over-year same-store comparisons. These discoveries might in turn guide decisions about future stocking levels, product sales and promotions. Also, they may determine the building of additional stores in under-served areas.

Upgrade in Job Roles

In recent times, there have been a lot of advancements in the data science industry. With these advancements, different businesses are in better shape to extract much more value out of their data. With an increase in expectation, there is a shift in the roles of both data scientists and business analysts now. The data scientists should move from statistical focus phase to more of a research phase. But the business analysts are now filling in the gap left by data scientists and are taking their roles up.

We can see it as an upgrade in both the job roles. Business analysts now hold the business angle firm but are also handling the statistical and technical part of the things too. Business analysts are now more into predictive analytics. They are at a stage now where they can use off-the-shelf algorithms for predictions in their business domains. BA’s are not only for reporting and business mindset but now are more into the prescriptive analytics too. They are handling the role of model building, data warehousing and statistical analysing.

Summary

How this question is answered will be fascinating to watch. It could be that the data science field has to completely overhaul what it can offer, overcoming seeming off-limit barriers. Alternatively, it could be that businesses discover their expectations can’t be met and have to adjust to this reality in a productive manner rather than get bogged down in frustration.

In conclusion, 2019 promises to be a year where smart systems make further inroads into our personal and professional lives. More importantly, I expect our professional lives to get more sophisticated with a variety of agents and systems helping us get more of out of our time in the office!

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start

Furthermore, if you want to read more about data science, you can read our blogs here

Also, the following are some blogs you may like to read

Big Data and Blockchain

What is Predictive Model Performance Evaluation

AI and intelligent applications

 

The Rise of Edge Computing

The Rise of Edge Computing

Introduction

Computing infrastructure is an ever-changing landscape of technol­ogy advancements. Current changes affect the way companies deploy smart manufacturing systems to make the most of advancements.

The rise of edge computing capabilities coupled with tradi­tional industrial control system (ICS) architectures provides increasing levels of flexibility. In addition, time-synchronized applications and analytics augment, or in some cases minimize, the need for larger Big Data operations in the cloud, regardless of cloud premise.

In this blog, we will start with the definition of edge computing. After that, we will discuss the need of edge computing and it’s applications. Also, we will try to understand the scope of edge computing in the future.

What is Edge computing

Consolidation and the centralized nature of cloud computing have proven cost-effective and flexible, but the rise of the IIoT and mobile computing has put a strain on networking band­width. Ultimately, not all smart devices need to use cloud comput­ing to operate. In some cases, architects can — and should — avoid the back and forth. Edge computing could prove more efficient in some areas where cloud computing operates.

Furthermore, edge computing permits data processing closer to it’s origin (i.e., motors, pumps, generators or other sensors), reducing the need to transfer that data back and forth between the cloud.

Additionally, think of edge computing in manufacturing as a network of mi­cro data centers capable of hosting, storage, computing and analysis on a localized basis while pushing aggregate data to a centralized plant or enterprise data center, or even the cloud (private or public, on-premise or off) for further analysis, deeper learning, or to feed an artificial intelligence (AI) engine hosted elsewhere.

According to Microsoft, in edge computing, compute resources are “placed closer to information-generation sources to reduce network latency and bandwidth usage generally associated with cloud computing.” This helps to ensure continuity of services and operations even if cloud connections aren’t steady.

Also, this moving of compute and storage to the “edge” of the network, away from the data centre and closer to the user, cuts down the amount of time it takes to exchange messages compared with traditional centralized cloud computing. Moreover, according to research by IEEE, it can help to balance network traffic, extend the life of IoT devices and, ultimately, reduce “response times for real-time IoT applications.”

Terms in Edge Computing

Like most technology areas, edge computing has its own lexicon. Here are brief definitions of some of the more commonly used terms

  • Edge devices: These can be any device that produces data. These could be sensors, industrial machines or other devices that produce or collect data.
  • Edge: What the edge depends on the use case. In a telecommunications field, perhaps the edge is a cell phone or maybe it’s a cell tower. Furthermore, in an automotive scenario, the edge of the network could be a car. Also, in manufacturing, it could be a machine on a shop floor. Additionally, in enterprise IT, the edge could be a laptop.
  • Edge gateway: A gateway is a buffer between where edge computing processing is done and the broader fog network. The gateway is the window into the larger environment beyond the edge of the network.
  • Fat client: Software that can do some data processing in edge devices. This is opposite to a thin client, which would merely transfer data.
  • Edge computing equipment: Edge computing uses a range of existing and new equipment. We can outfit many devices, sensors and machines to work in an edge computing environment by simply making them Internet-accessible. Cisco and other hardware vendors have a line of rugged network equipment that has hardened exteriors meant to be used in field environments. A range of compute servers and even storage-based hardware systems like Amazon Web Service’s Snowball have usage in edge computing deployments.
  • Mobile edge computing: This refers to the buildout of edge computing systems in telecommunications systems, particularly 5G scenarios

Why Rise in Edge Computing

1. Latency in decision making

Businesses are getting a huge boost from computerised systems, especially as they evolve into the cloud era. But bringing that same level of technology across different sites has proven to be not so straightforward for many companies, particularly as the sites started generating more data. The main concern is latency, that being the time it takes for data to move between points. As with the NYSE, a little distance goes a long way in the computer world, so it stands to reason that delays in sending data needed to reach decisions will translate into delays for the business.

2. Decentralisation and scaling

To some, it may seem counterintuitive to move away from the centre. Wasn’t centralisation the whole point of cloud systems? But the cloud isn’t about pooling everything in the middle. It’s about scale and making it easier to access the services that the business uses every day. Also, the transfer gap problem between sites and data centres predates the cloud era. Yet cloud can exacerbate it. The only way to overcome this transfer gap is to move some of the data centres to where the data is.

3. Process Optimisation

With edge computing, data centres can execute rules that are time sensitive (like “stop the car” in case of driverless vehicles), and then stream data to the cloud in batches when bandwidth needs aren’t as high. Furthermore,the cloud can then take the time to analyze data from the edge, and send back recommended rule changes — like “decelerate slowly when the car senses human activity within 50 feet.”

4. Cost

Cost is also a driving factor for edge computing. The bulk of telemetry data that is from the sensors and actuators is likely not relevant for the IoT application. The fact a temperature sensor reports a 20ºC reading every second might not be interesting until the sensor reports a 40ºC reading. Edge computing allows for the filtering and processing of data before sending it to the cloud. This reduces the network cost of data transmission. It also reduces the cloud storage and processing cost of data that is not relevant to the application.

5. Resourcefulness

Storing and processing data on the edge and only sending out to the cloud what will be used and useful saves bandwidth and server space.

Where all we are using it

1. Grid Edge Control and Analytics

Grid Edge computing solutions are helping the utility monitor and analyse these additional renewable power generating resources integrated into their grid, in real time. This is something legacy SCADA systems are unable to offer.

From residential rooftop solar to solar farms, commercial solar, electric vehicles and wind farms, smart meters are generating a ton of data that helps utilities to view the amount of energy available and required, allowing their demand response to become more efficient, avoid peaks and reduce costs. This data is first processed in the Grid Edge Controllers that perform local computation and analysis of the data only send necessary actionable information over a wireless network to the Utility.

2. Oil and Gas Remote Monitoring

Safety monitoring within critical infrastructures such as oil and gas utilities is of utmost importance. For this reason, many cutting edge IoT monitoring devices are being deployed in order to safeguard against disaster. Edge computing allows data to be analysed, processed, and then delivered to end-users in real-time, allowing for control centres to access data as it occurs in order to foresee and prevent malfunctions or incidents before they occur. This is really important. As, when dealing with critical infrastructures such as oil and gas or other energy services, any failures within a particular system have the potential to be catastrophic and should always warrant the highest levels of precaution.

3. Internet of Things

A smart window firm monitors windows for errors, weather information, maintenance needs and performance. This generates a massive stream of data as each device is regularly reporting information. Edge services filter this information and report a summary back to a centralized service that is running from the firm’s primary data centres. By summarizing information before reporting it, global bandwidth consumption is reduced by 99%.

4. E-Commerce

An e-commerce company delivers images and static web content from a content delivery network. They also perform processing at edge data centres to quickly calculate product recommendations for customers.

5. Markets

A hedge fund pays an expensive premium for servers that are in close proximity to various stock exchanges to achieve extremely low latency trading. Trading algorithms are deployed on these machines. These servers are expensive and resource constrained. As such, they connect back to a cloud service for processing support.

6. Games

A game platform executes certain real-time elements of the game experience on edge servers near the user. The edges connect to a cloud backend for support processing. The backend is run from three regions that need not be close to the end-user.

Predictions for Edge Computing in Future

According to IDC  by 2020, the IT spend on edge infrastructure will reach up to 18% of the total spend on IoT infrastructure. That spend is driven by the deployment of converged IT and OT systems which reduces the time to value of data collected from their connected devices IDC adds. It’s what we explained and illustrated in a nutshell.

According to a November 1, 2017, announcement regarding research of the edge computing market across hardware, platforms, solutions and applications (smart city, augmented reality, analytics etc.) the global edge computing market is expected to reach USD 6.72 billion by 2022 at a compound annual growth rate of a whopping 35.4 per cent.

The major trends responsible for the growth of the market in North America are all too familiar. Also, there is a growing number of devices and dependency on IoT devices. Hence, the need for faster processing, the increase in cloud adoption, and the increase in pressure on networks.

In an October 2018 blog post, Gartner’s Rob van der Meulen said that currently, around 10% of enterprise-generated data is created and processed outside a traditional centralized data centre or cloud. By 2022, Gartner predicts this figure will reach 50 per cent.

Summary

Edge is still in early stage adoption, but one thing is clear: Edge devices are subject to large-scale investments from cloud suppliers to offload bandwidth. Also, there are latency issues due to an explosion of the Internet of Things (IoT) data in both industrial and commercial applications.

Edge soon will likely increase in adoption where users have questions about how or if the cloud applies for the specific use case. Cloud-level interfaces and apps will migrate to the edge. Industrial application hosting and analytics will become common at the edge, using virtual servers and simplified operational technology-friendly hardware and software.

Benefits in network simplification, security and bandwidth accompany the IT simplification.

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are having an interest in learning Data Science, click here to start

Furthermore, if you want to read more about data science, you can read our blogs here

Also, the following are some blogs you may like to read

MATLAB for Data Science

Top 5 Ways to Evaluate Data Science Competency

Can you learn Data Science and Machine Learning without Maths?

 

Creativity & Curiosity: The Glue Holding Innovation and Data Science

Creativity & Curiosity: The Glue Holding Innovation and Data Science

Introduction

As organizations turn to digital transformation strategies, they are also increasingly forming teams around the practice of Data Science. Currently, the main challenge for many CIOs, CDOs, and other Chief Data Scientists consist of positioning the Data Science function precisely where an organization needs it to improve its present and future activities. This implies embedding Data Science teams to fully engage with the business and adapting the operational backbone of the company

Furthermore, with all the requirements and expectations businesses are having from data science, innovation and experimentation will be key factors moving data science forward. Moreover, let us have a look at the growth of data science in recent years. After that, we will understand how creativity and innovation have accelerated this growth till now and the future prospects.

The Growth of Data Science

LinkedIn recently published a report naming the fastest growing jobs in the US based on the site’s data. The social networking site compared data from 2012 and from 2017 to complete the report. Eventually, the top two spots were machine learning jobs, which grew by 9.8X in the past five years, and data scientist, which grew 6.5X since 2012. So why are data science positions, and specifically machine learning positions, growing so fast?

1. The amount of data has skyrocketed
Not only has roughly 90 per cent of the data created in the last two years, but the current data output is 2.5 quintillion bytes of data daily

2. Data-driven decisions are more profitable
In the end, for many companies, data is not useful unless it is beneficial, which it certainly is. Data not only helps companies make better decisions but those decisions also usually come with a financial gain. Furthermore, a study by Harvard Business Review found that “companies in the top third of their industry in the use of data-driven decision making were more productive and profitable than their competitors.

3. Machine learning is changing how you do business
Machine learning is a type of artificial intelligence (AI) where the systems can actually learn and evolve. Also, it has infiltrated many industries, from marketing to finance to health care. The advanced algorithms save time and resources, making quick, correct decisions based on past learnings

4. Machine learning provides better forecasting
Machine learning algorithms often find hidden insights that went unseen by the human eye. With the vast amount of data in the processing stage, even an entire team of data scientists might miss a particular trend or pattern. The ability to predict what will happen in the market is what keeps businesses competitive.

Why Creativity and Curiosity are Needed for Growth of Data Science?

Data Science is More About Asking Why?

Data science is focussed on querying every result and having an inquisitive mindset. You can not be a good data scientist if you lack the inquisitive skills. Furthermore, an Inquisitive nature in a data scientist plays a major role in bringing out hidden patterns and insights present in the data. Data can be complex and answer to your hypothesis may lie somewhere hidden in the data. But, It is the inquisitive skills of a data scientist which leverages the hidden potential of data in achieving business goals.

Varied Implementations in Different Domains

Industry influencers, academicians, and other prominent stakeholders certainly agree that data science has become a big game changer in most, if not all, types of modern industries over the last few years. As big data continues to permeate our day-to-day lives, there has been a significant shift of focus from the hype surrounding it to finding real value in its use. Also, data science finds it’s usage in the most unlikely places one can ever think of now. Such varied implementations and decision making require creativity and curiosity in the minds of data scientists.

Different Problems — One Solution

This talks about the idea of dealing with multiple problems at hand with one solution. There can be solutions to different problems, but re-using an old solution from different problem space and applying it in the unlikely domains(extreme experimentation) sure has resulted in some great ideas recently. For example, CNN in deep learning is a classic implementation for image processing. But who could have thought that an image processing algorithm can also give strikingly good results in processing natural language? But, today CNN is also widely used for doing natural language processing. Creativity and curiosity take time to innovate things but when it does, it all worth the time invested!

One Problem — Multiple Solutions

We emphasise more on having multiple solutions for a single problem here. Having multiple ways of solving a given problem requires creativeness in mind. One should be ready to experiment and challenge the existing methods of solving a given problem. Furthermore, innovation can only occur when existing methods are challenged rather than just plainly accepting them. If everyone was to accept earlier beliefs, then maybe we could have been stuck with linear regression forever and will not have algorithms like SVM and Random Forest. Hence, It is this inquisitive nature which actaully gave birth to these classic ML algorithms today we have with us.

Examples of Innovations in Data Science in Recent Years

1. Coca-Cola managed to strengthen its data strategy by building a digital-led loyalty program. Coca-Cola director of data strategy was interviewed by ADMA managing editor. The interview made it clear that big data analytics is strongly behind customer retention at Coca-Cola.

2. Netflix is a good example of a big brand that uses big data analytics for targeted advertising. With over 100 million subscribers, the company collects huge data, which is the key to achieving the industry status Netflix boosts. If you are a subscriber, you are familiar with how they send you suggestions for the next movie you should watch. Basically, this is done using your past search and watch data. This data is used to give them insights on what interests the subscriber most.

3. Amazon leverages big data analytics to move into a large market. The data-driven logistics gives Amazon the required expertise to enable creation and achievement of greater value. Focusing on big data analytics, Amazon whole foods are able to understand how customers buy groceries and how suppliers interact with the grocer. This data gives insights whenever there is a need to implement further changes.

Creative Solutions for Innovation using Data Science

1. Profit model
Crunching numbers can identify untapped potential hidden in the profit margins or pin-point insufficiently used revenue streams. Simulations can also show if specific markets are ready. Data can help you apply the 80/20 principle and focus on your top clients.

2. Network
Data recorded and analyzed by one company can benefit others in numerous ways, especially if the two entities are in complementary businesses. Just imagine how a hotel could boost their bookings by using the weather and delayed flights information collected by a nearby airport during their regular operations.

3. Structure
Algorithms to ingest organizational charts with augmented information from thousands of companies and produce models of the best performing. It could offer recipes for the gender and educational composition of a Board to maximize talent. This could end artificial efforts of having more women on the board and produce even recommendations of possible candidates by scanning professional profiles.

4. Process
Data science consulting company InData Labs states that using analytics in the company’s operations is the best way to handle uncertainty by teaching staff to guide their decisions on results and numbers instead of gut feeling and customs.

5. Product performance
One company which already does this through their newsfeed automation is Facebook. They have innovated the way it looks for each individual user to boost their revenue from PPC ads. By employing data science in every aspect of user experience,

you can create better products and cut development costs by abandoning bad ideas early on.

How to Encourage Curiosity and Creativity among Data Scientists

1. Give importance to data science in growth planning
Don’t bury it under another department like marketing, product, finance, etc. Set up an innovation and development wing for research and experimentation purposes which is separate from business deadlines. The data science team will need to collaborate with other departments to provide solutions. But it will do so as for equal partners, not as a support staff that merely executes on the requirements from other teams. Instead of positioning data science as a supportive team in service to other departments, make it responsible for business goals

2. Provide the required infrastructure
Give full access to data as well as the compute resources to process their explorations. Requiring them to ask permission or request resources will impose a cost and less exploration will occur.

3. Focus on learning over knowing
Entire company must have common values for things like learning by doing, being comfortable with ambiguity, balancing long-and short-term returns. These values should spread across the entire organisation as they cannot survive in isolation.

4. Laying importance of extreme experimentation
More emphasis should be put on experimentation tasks and mindset. Having an experimentation mindset gives the ability to data scientists to take steps into something innovative. Experimentation brings you a step closer to innovation and data science is all about it!

Summary

Creativity in data science can be anything from innovative features for modelling, development of new tools, cool new ways to visualise data, or even the types of data that we use for analysis. What’s interesting in data is that everyone will do things differently, depending on how they think about the problem. When put that way, almost everything we do in data science can be creative if we think outside the box a little bit.

The best way I can think to describe creativity in a candidate or in an approach is when they give you this moment of “wow!.” Ideally, as a company or team, you want to have a maximum number of moments like this — keep good ideas flowing, prioritize, and execute.

Follow this link, if you are looking to learn more about data science online!

You can follow this link for our Big Data course!

Additionally, if you are interested in learning Data Science, click here to get started

Furthermore, if you want to read more about data science, you can read our blogs here

Also, the following are some suggested blogs you may like to read

Beginner’s Guide for time-series forecasting

Evolution of Chatbots & their Performance

Top takeaways from R Studio conf 2019