Data visualization is an important component of many company approaches due to the growing information quantity and its significance to the company. In this blog, we will be understanding in detail about visualisation in Big Data. Furthermore, we will be looking into the areas like why visualisation in big data is a tedious task or are there any tools available for visualising Big Data
What is Data Visualisation?
Data display represents data in a systematic manner, including information unit characteristics and variables. Data discovery techniques based on visualization enable company consumers to generate customized analytical opinions using disparate information sources. Advanced analytics can be incorporated into techniques for the development on desktop and laptop or mobile devices like tablets and smartphones of interactive and animated Graphics.
What is Big Data Visualisation?
Big data are large volumes, elevated speed and/or high-speed information sets that involve fresh types of handling to optimize processes, discover understanding and make choices. Data capture, storage, evaluation, sharing, searches and visualization face great challenges for big data. Visualization could be considered as “large information front end. There’s no data visualization myth.
It is important to visualize only excellent information: an easy and fast view can show something incorrect with information just like it helps to detect exciting patterns.
Visualization always manifests the correct choice or intervention: visualization is not a substitute for critical thinking.
Visualization brings assurance: data are displayed, not showing an exact image of what is essential. Visualization with various impacts can be manipulated.
Tables, diagrams, pictures and other intuitive display methods to represent the information are created using visualization methods. Visualizing large information is not as simple as conventional tiny information sets. The expansion of traditional methods to visualization was already evolved but far enough. Many scientists use feature extraction and geometrical modeling in large-scale data visualization to significantly decrease the volume of information before real information processing. When viewing big data, it is also very essential to select the correct representation of information.
Problems in Visualising Big Data
In the visual analysis, scalability and dynamics are two main difficulties. The visualization of big data (structured or unstructured) with diversity and heterogeneity is a big difficulty. For big data analysis, speed is the required variable. Big information does not make it simple to design a fresh visualization tool with effective indexing. In order to improve the handling of Big Data scalability factors that influence information viewing decisions, cloud computing, and the sophisticated graphical user interface can be combined with Big Data.
Unstructured information formats such as charts, lists, text, trees, and other information must be used by visualization schemes. Often large information has unstructured formats. Due to the constraints on bandwidth and power consumption, visualization should step nearer to the data to effectively obtain significant information. The software for visualization should be executed on location. Due to the large volume of the information, visualization requires huge parallelisation. The difficulty in simultaneous viewing algorithms is to break down an issue into autonomous functions that can be carried out at the same time.
There are also the following problems for big data visualization:
Visual noise: Most items on the dataset are too related to each other. There are also the following issues when viewing large-scale information. Users can not split them on the display as distinct items.
Info loss: Visible data sets may be reduced, but information loss may occur.
Broad perception of images: data display techniques are restricted not only by aspect ratio and device resolution but also by physical perception limitations.
The elevated pace of changes in the picture: users view information and are unable to respond to the amount of changes in information or its intensity.
High-performance requirements: In static visualization it is hard to notice because of reduced demands for display velocity— high performance demands.
Choice of visualization factors
Audience: The information depiction should be adjusted to the target audience. If clients are ending up in a fitness application and are looking at advancement, then simplicity is essential. On the other side, when information ideas are for scientists or seasoned decision-makers, you can and should often go beyond easy diagrams.
Satisfaction: The data type determines the strategies. For instance, when there are metrics that change over the moment, the dynamics will most likely be shown with line graphs. You will use a dispersion plot to demonstrate the connection between two components. Bar diagrams are ideal for comparison assessment, in turn.
Context: The way your graphs appear can be taken with distinct methods and therefore read according to the framework. For instance, you may want to use colors of one color to highlight a certain figure, which is a major profit increase relative to other years, and choose a shiny one as the most important component on the graph. Instead, contrast colors are used to distinguish components.
Dynamics: Dynamics. Data are distinct and each means a distinct pace of shift. For example, each month or year the financial results can be measured while time series and data tracking change continuously. Dynamic representation (steaming) or a static visualization can be considered, depending on the type of change.
Objective: The objective of viewing the information also has a major effect on the manner in which it is carried out. Visualizations are built into dashboards with checks and filters to carry out a complicated study of a scheme or merge distinct kinds of information for a deeper perspective. Dashboards are, however, not required to display one or more occasional information.
Visualization Techniques for Big Data
1. Word Clouds
Word clouds work easy: the larger and bolder the word is in the term cloud the more a particular word is displayed in a source of text information (such as a lecture, newspaper post or database).
Here is an instance of USA Today using the United States. State of Union Speech 2012 by President Barack Obama:
As you can see, words like “American,” “jobs,” “energy” and “every” stand out since they were used more frequently in the original text.
Now, compare that to the 2014 State of the Union address:
You can easily see the similarities and differences between the two speeches at a glance. “America” and “Americans” are still major words, but “help,” “work,” and “new” are more prominent than in 2012.
2. Symbol Maps
Symbol maps are merely maps shown over a certain length and latitude. You can rapidly create a strong visual with the “Marks” card at Tableau, which tells customers of their place information. You can also use the information to manage the form of the label on the map using the illustration in the Pie chart or forms for a different degree of detail.
These maps can be as simple or as complex as you need them to be
3. Line charts
Alternatively known as a row graph, a row graph is a graph of the information shown using a number of rows. Line diagrams show rows horizontally through the diagram, with the scores axis on the left hand of the diagram. An instance of a line chart displaying distinctive Computer Hope travelers can be seen in the image below.
As can be seen in this example, you can easily see the increases and decreases each year over different years.
4. Pie charts
A diagram is a circular diagram, split into sections like wedges, which shows the amount. The complete valuation of each coin is 100% and is a proportional portion of the whole.
The portion size can easily be understood on a look at pie charts. They are commonly used to demonstrate the proportion of expenditure, population sections or study responses across a big number of classifications.
5. Bar Charts
A bar graph is a visual instrument which utilizes bars to match information between cities. bars are also called a bar chart or bar diagram. A bar chart can be executed horizontally or vertically. What we need to understand is that the longer the bar is, the more valuable it is. Two axes are the bar graphs. The horizontal axis (or x-axis) is shown on a graph of the vertical bar, as shown above. They are years in this instance. The vertical axis is the magnitude. The information sequence is the colored rows.
Bar charts have three main attributes:
A bar character allows for a simple comparison of information sets among distinct organizations.
The graph shows classes on one axis and on the other a separate value. The objective is to demonstrate the connection between the two axes.
Bar diagrams can also display over moment large information modifications.
6. Heat Maps
A heat map represents information that are displayed two-dimensionally by color values. An instant visual overview of the data is provided by a straightforward heat chart.
There can be numerous methods to show thermal maps, but they all share one thing in common: to transmit interactions between information values in a tablet, they use a color that would be much difficult to comprehend.
Visualisation Tools for Big Data
1. Power BI
Power BI is a company analysis option that enables you to view and share your information or integrate them into your app or blog. Connect to hundreds of information sources and live dashboards and accounts to take your information to life.
Microsoft Power BI is used to discover perspectives into the information of an organization. Power BI can communicate, convert and wash information into the data model and generate chart or diagram to display information graphics. All this can be communicated within the organisation with other consumers of Power BI.
Data models generated by Power BI can be used by organizations in many ways, including story telling through charts and views of data and “what if” scenarios inside the data. Power BI accounts can also respond to issues in real time and assist predict how departments will fulfill company criteria.
The Power BI can also provide executives or executives with corporate dashboards to provide them with an understanding of the agencies.
Kibana is an open-source log analysis and time series analysis information visualization and exploring device for the surveillance of applications and operational intelligence instances. It provides strong and easy-to-use characteristics like histograms, diagrams, pie charts, thermal maps and integrated geospatial assistance. In addition, it ensures close inclusion with the famous analytics and search engine Elasticsearch, which makes Kibana the main option for viewing the information saved in Elasticsearch.
Kibana has been intended with Elasticsearch to render large and complicated information flows understandable by visual depiction more rapidly and smoothly. Elasticsearch analytics provide both information and improved aggregation mathematical transformations. The application produces a versatile, vibrant dashboard with PDF records on request or on timetable. The generated documents can depict information with customisable colors and highlighted search outcomes in the form of bar, row, scatter plot and paste graph sizes. Kibana also involves visualized data sharing instruments.
Grafana is a metrics & visualizing package of open source analysis. It is used most frequently for moment serial data visualization for infrastructure and implementation analysis, but many use it in other areas including agricultural equipment, domestic automation, climate, and process control.
Grafana is a temporary information sequence display instrument. A graphical description can be obtained from a lot of gathered information of the position of a business or organisation. How are they doing it? The collaborative editing of Wikidata, an extensive database of information, that increasingly builds papers in Wikipedia, utilizes the grafana.wikimedia.org to demonstrate openly (in our situation we do so on a regular basis) the publishings conducted out by associates and computers, in a certain span of moment produced and edited’ websites,’ or information sheets:
Tableau has been utilized in the business intelligence industry as a strong and rapidly increasing information vision instrument. It makes it readily understandable to simplify raw information.
Data analysis with Tableau is very quick and the visualizations are in the shape of dashboards and tablets. The information produced using Tableau can be comprehended at every stage in an organisation by the specialist. It even enables a non-technical user a personalized dashboard to be created.
The best feature Tableau are
Collaboration of data
Tableau software is fantastic because it does not require any technical or programming abilities to function. The instrument has attracted individuals from all sectors, such as company, scientists, various industries, etc.
Static or vibrant visualizations can be interactive viewing often results in discovery and works better than static information instruments. Interactive views can assist you to get an overview of big data. The scientific method can be facilitated by interactive brushing and connecting visualisation methods to networks or web-based instruments. The web-based display enables to ensure dynamic data is kept up to date and updated.
There is not sufficient room for extending some standard visualization methods to manage big data. More fresh Big Data viewing techniques and instruments for various Big Data apps should be created
Visualizing the data is important as it makes it easier to understand large amount of complex data using charts and graphs than studying documents and reports. It helps the decision makers to grasp difficult concepts, identify new patterns and get a daily or intra-daily view of their performance. Due to the benefits it possess, and the rapid growth in analytics industry, businesses are increasingly using data visualizations; which can be assessed from the prediction that the data visualization market is expected to grow annually by 9.47% to $7.76 billion by 2023 from $4.51 billion in 2017.
R is a programming language and a software environment for statistical computing and graphics. It offers inbuilt functions and libraries to present data in the form of visualizations. It excels in both basic and advanced visualizations using minimum coding and produces high quality graphs on large datasets.
This article will demonstrate the use of its packages ggplot2 and plotly to create visualizations such as scatter plot, boxplot, histogram, line graphs, 3D plots and Maps.
#install package ggplot2
#load the package
There are a lot of datasets available in R in package ‘datasets’, you can run the command data() to list those datasets and use any dataset to work upon. Here I have used the dataset named ‘economics’ which gives the monthly U.S. data of various economic variables like unemployment for the time period 1967-2015.
You can view the data using view function-
We’ll make a simple scatter plot to view how unemployment has fluctuated over the years by using plot function-
ggplot() is used to initialize the ggplot object which can be used to declare the input dataframe and set of plot aesthetics. We can add geom components to it that acts as its layer and are used to specify the plot’s features.
We would use its feature geom point which is used to create scatter plots.
When there is overplotting, one or more points are in the same place and we can’t tell by looking at the plot that how many points are there. In that case, we can use the jitter geom which adds a small amount of variation to the location of each point that is it slightly moves the point, which is used to spread out the points that would otherwise be overplotted.
+labs(title="Number of unemployed people in U.S.A. from 1967 to 2015",
x="Year",y="Number of unemployed people")
Let’s group the data according to year and view how average unemployment fluctuated through these years.
We will load dplyr package to manipulate our data and lubridate package to work with date column.
Now we will use mutate function to create a column year from the date column given in economics dataset by using the year function of lubridate package. And then we will group the data according to year and summarise it according to average unemployment-
Now, lets view the data as a line plot using line geom of ggplot2
(Since here we want the height of the bar be equal to avg_unempl, so we need to specify stat equal to identity)
Plotting Time Series Data
In this section, I’ll be using a dataset that records the number of tourists who visited India from 2001 to 2015 which I have rearranged such that it has 3 columns, country, year and number of tourists arrived.
To visualize the plot of the number of tourists that visited the countries over the years in the form of line graph, we use geom_line-
For convenience purpose, you can change the theme of the background as well, here I am keeping the theme as white-
These were some basic functions of ggplot2, for more functions, check out the official guide.
Plotly is deemed to be one of the best data visualization tools in the industry.
Lets construct a simple line graph of two vectors by using plot_ly function that initiates a visualization in plotly. Since we are creating a line graph, we have to specify type as ‘scatter’ and mode as ‘lines’.
We can modify the map as well. Here we have increased the size of the points and changed its color. We have also added text that is the location of the point which would show the location name when the cursor is placed on it.
These were some of the visualizations from package ggplot2 and plotly. R has various other packages for visualizations like graphics and lattice. Refer to the official documentation of R to know more about these packages.
To know more about our Data Science course, click below
Data science is one of the hottest topics in the 21st century because we are generating data at a rate which is much higher than what we can actually process. A lot of business and tech firms are now leveraging key benefits by harnessing the benefits of data science. Due to this, data science right now is really booming.
In this blog, we will deep dive into the world of machine learning. We will walk you through machine learning basics and have a look at the process of building an ML model. We will also build a random forest model in python to ease out the understanding process.
What is Machine Learning?
Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.
There are many different types of machine learning algorithms, with hundreds published each day, and they’re typically grouped by either learning style (i.e. supervised learning, unsupervised learning, semi-supervised learning) or by similarity in form or function (i.e. classification, regression, decision tree, clustering, deep learning, etc.). Regardless of learning style or function, all combinations of machine learning algorithms consist of the following:
Representation (a set of classifiers or the language that a computer understands)
Evaluation (aka objective/scoring function)
Optimization (search method; often the highest-scoring classifier, for example; there are both off-the-shelf and custom optimization methods used)
Steps for Building ML Model
Here is a step-by-step example of how a hospital might use machine learning to improve both patient outcomes and ROI:
1. Define Project Objectives
The first step of the life cycle is to identify an opportunity to tangibly improve operations, increase customer satisfaction, or otherwise create value. In the medical industry, discharged patients sometimes develop conditions that necessitate their return to the hospital. In addition to being dangerous and troublesome for the patient, these readmissions mean the hospital will spend additional time and resources on treating patients for the second time.
2. Acquire and Explore Data
The next step is to collect and prepare all of the relevant data for use in machine learning. This means consulting medical domain experts to determine what data might be relevant in predicting readmission rates, gathering that data from historical patient records, and getting it into a format suitable for analysis, most likely into a flat file format such as a .csv.
3. Model Data
In order to gain insights from your data with machine learning, you have to determine your target variable, the factor of which you are trying to gain a deeper understanding. In this case, the hospital will choose “readmitted,” which is included as a feature in its historical dataset during data collection. Then, they will run machine learning algorithms on the dataset that build models that learn by example from the historical data. Finally, the hospital runs the trained models on data the model hasn’t been trained on to forecast whether new patients are likely to be readmitted, allowing it to make better patient care decisions.
4. Interpret and Communicate
One of the most difficult tasks of machine learning projects is explaining a model’s outcomes to those without any data science background, particularly in highly regulated industries such as healthcare. Traditionally, machine learning has been thought of as a “black box” because of how difficult it is to interpret insights and communicate their value to stakeholders and regulatory bodies alike. The more interpretable your model, the easier it will be to meet regulatory requirements and communicate its value to management and other key stakeholders.
5. Implement, Document, and Maintain
The final step is to implement, document, and maintain the data science project so the hospital can continue to leverage and improve upon its models. Model deployment often poses a problem because of the coding and data science experience it requires, and the time-to-implementation from the beginning of the cycle using traditional data science methods is prohibitively long.
A certain car manufacturing company X is looking to target its customers for their particular car model. Customers are identified by their age, salary, and Gender. The organisation wants to identify or predict which customers will affect the sales of their new car and actually purchase it.
We have a purchased column here which holds two values i.e 0 and 1. 0 indicates that the car has not been purchased by a certain individual. 1 indicates the sale of the car.
Importing the Required Libraries
You need to import all the required libraries first which will ease the model building parts for us. We are using keras to build our random forest model. We are using the matplotlib library to plot the charts and graphs and visualise results. In the end, we are also importing functions from the sklearn module which can help us in splitting our data into training and testing parts
# Importing the libraries
import numpy asnp
import matplotlib.pyplot asplt
import pandas aspd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
Loading the Dataset
In this step, you need to load your dataset in the memory. After that, we separate out the dependent and the independent variables for the training of our classifier. In most of the cases, you need to separate the dependent and he the independent variables
# Importing the dataset
Splitting the Dataset to Form Training and Test Data
In all the cases, you need to make some partitions in your data. A major chunk of your data acts as a training set and a smaller chunk acts as a test set. There are no clearly defined criteria on the proportion of the training and the test set. But most people follow 70–30 or 75–25 rule where a larger chunk is your training set. We train the data on the training set and test it on the test set. This process is known as validation. The prime idea behind this purpose is that one needs to gauge the performance of the model on the data which model has never seen before. In the real-world scenarios, the model will be predicting values on the unseen data. Furthermore, techniques like validation help us in avoiding overfitting or underfitting the model.
Overfitting refers to the case when our model has learnt all about the specific data on which it trained. It will work well on the training data but will have poor accuracy for any unseen data point. Overfitting is like your model is very specific to the data it has and has no generality. Similarly, underfitting is the case where your model is very general and is not able to predict well for your specific use-case. To achieve the best model accuracy, you need to strike a perfect balance between overfitting and under-fitting.
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
In this case, we are fitting our model with the training data. We are using the random forest model exposed by the sklearn package in python. Ultimately, we pass the dependent and independent features separately through which our model makes an internal mapping between them using mathematical coefficients.
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
In this part, we are passing unseen values to our model on which it is making predictions. We use a confusion matrix to derive metrics like accuracy, precision, and recall for our model. These metrics help us to understand the performance of the model.
# Predicting the Test set results
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
Visualising the Predictions
Additionally, we have made an attempt to visualise the predictions of our model using the below code.
Hence, in this Machine Learning Tutorial, we studied the basics of ML. Earlier machine learning was the theory that computers can learn without being programmed to perform specific tasks. But now, the researchers interested in artificial intelligence wanted to see if computers could learn from data. They learn from previous computations to produce reliable decisions and results. It’s a science that’s not new — but one that’s gaining fresh momentum.
Follow this link, if you are looking to learn more about data science online!
There are a huge number of ML algorithms out there. Trying to classify them leads to the distinction being made in types of the training procedure, applications, the latest advances, and some of the standard algorithms used by ML scientists in their daily work. There is a lot to cover, and we shall proceed as given in the following listing:
1. Statistical Algorithms
Statistics is necessary for every machine learning expert. Hypothesis testing and confidence intervals are some of the many statistical concepts to know if you are a data scientist. Here, we consider here the phenomenon of overfitting. Basically, overfitting occurs when an ML model learns so many features of the training data set that the generalization capacity of the model on the test set takes a toss. The tradeoff between performance and overfitting is well illustrated by the following illustration:
Overfitting – from Wikipedia
Here, the black curve represents the performance of a classifier that has appropriately classified the dataset into two categories. Obviously, training the classifier was stopped at the right time in this instance. The green curve indicates what happens when we allow the training of the classifier to ‘overlearn the features’ in the training set. What happens is that we get an accuracy of 100%, but we lose out on performance on the test set because the test set will have a feature boundary that is usually similar but definitely not the same as the training set. This will result in a high error level when the classifier for the green curve is presented with new data. How can we prevent this?
Cross-Validation is the killer technique used to avoid overfitting. How does it work? A visual representation of the k-fold cross-validation process is given below:
The entire dataset is split into equal subsets and the model is trained on all possible combinations of training and testing subsets that are possible as shown in the image above. Finally, the average of all the models is combined. The advantage of this is that this method eliminates sampling error, prevents overfitting, and accounts for bias. There are further variations of cross-validation like non-exhaustive cross-validation and nested k-fold cross validation (shown above). For more on cross-validation, visit the following link.
There are many more statistical algorithms that a data scientist has to know. Some examples include the chi-squared test, the Student’s t-test, how to calculate confidence intervals, how to interpret p-values, advanced probability theory, and many more. For more, please visit the excellent article given below:
Classification refers to the process of categorizing data input as a member of a target class. An example could be that we can classify customers into low-income, medium-income, and high-income depending upon their spending activity over a financial year. This knowledge can help us tailor the ads shown to them accurately when they come online and maximises the chance of a conversion or a sale. There are various types of classification like binary classification, multi-class classification, and various other variants. It is perhaps the most well known and most common of all data science algorithm categories. The algorithms that can be used for classification include:
Support Vector Machines
Linear Discriminant Analysis
and many more. A short illustration of a binary classification visualization is given below:
For more information on classification algorithms, refer to the following excellent links:
Regression is similar to classification, and many algorithms used are similar (e.g. random forests). The difference is that while classification categorizes a data point, regression predicts a continuous real-number value. So classification works with classes while regression works with real numbers. And yes – many algorithms can be used for both classification and regression. Hence the presence of logistic regression in both lists. Some of the common algorithms used for regression are
Support Vector Regression
Partial Least-Squares Regression
For more on regression, I suggest that you visit the following link for an excellent article:
Both articles have a remarkably clear discussion of the statistical theory that you need to know to understand regression and apply it to non-linear problems. They also have source code in Python and R that you can use.
Clustering is an unsupervised learning algorithm category that divides the data set into groups depending upon common characteristics or common properties. A good example would be grouping the data set instances into categories automatically, the process being used would be any of several algorithms that we shall soon list. For this reason, clustering is sometimes known as automatic classification. It is also a critical part of exploratory data analysis (EDA). Some of the algorithms commonly used for clustering are:
Hierarchical Clustering – Agglomerative
Hierarchical Clustering – Divisive
K-Nearest Neighbours Clustering
EM (Expectation Maximization) Clustering
Principal Components Analysis Clustering (PCA)
An example of a common clustering problem visualization is given below:
The above visualization clearly contains three clusters.
Another excellent article on clustering refer the link
Dimensionality Reduction is an extremely important tool that should be completely clear and lucid for any serious data scientist. Dimensionality Reduction is also referred to as feature selection or feature extraction. This means that the principal variables of the data set that contains the highest covariance with the output data are extracted and the features/variables that are not important are ignored. It is an essential part of EDA (Exploratory Data Analysis) and is nearly always used in every moderately or highly difficult problem. The advantages of dimensionality reduction are (from Wikipedia):
It reduces the time and storage space required.
Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
It avoids the curse of dimensionality.
The most commonly used algorithm for dimensionality reduction is Principal Components Analysis or PCA. While this is a linear model, it can be converted to a non-linear model through a kernel trick similar to that used in a Support Vector Machine, in which case the technique is known as Kernel PCA. Thus, the algorithms commonly used are:
Ensembling means combining multiple ML learners together into one pipeline so that the combination of all the weak learners makes an ML application with higher accuracy than each learner taken separately. Intuitively, this makes sense, since the disadvantages of using one model would be offset by combining it with another model that does not suffer from this disadvantage. There are various algorithms used in ensembling machine learning models. The three common techniques usually employed in practice are:
Simple/Weighted Average/Voting: Simplest one, just takes the vote of models in Classification and average in Regression.
Bagging: We train models (same algorithm) in parallel for random sub-samples of data-set with replacement. Eventually, take an average/vote of obtained results.
Boosting: In this models are trained sequentially, where (n)th model uses the output of (n-1)th model and works on the limitation of the previous model, the process stops when result stops improving.
Stacking: We combine two or more than two models using another machine learning algorithm.
(from Amardeep Chauhan on Medium.com)
In all four cases, the combination of the different models ends up having the better performance that one single learner. One particular ensembling technique that has done extremely well on data science competitions on Kaggle is the GBRT model or the Gradient Boosted Regression Tree model.
We include the source code from the scikit-learn module for Gradient Boosted Regression Trees since this is one of the most popular ML models which can be used in competitions like Kaggle, HackerRank, and TopCoder.
In the last decade, there has been a renaissance of sorts within the Machine Learning community worldwide. Since 2002, neural networks research had struck a dead end as the networks of layers would get stuck in local minima in the non-linear hyperspace of the energy landscape of a three layer network. Many thought that neural networks had outlived their usefulness. However, starting with Geoffrey Hinton in 2006, researchers found that adding multiple layers of neurons to a neural network created an energy landscape of such high dimensionality that local minima were statistically shown to be extremely unlikely to occur in practice. Today, in 2019, more than a decade of innovation later, this method of adding addition hidden layers of neurons to a neural network is the classical practice of the field known as deep learning.
Deep Learning has truly taken the computing world by storm and has been applied to nearly every field of computation, with great success. Now with advances in Computer Vision, Image Processing, Reinforcement Learning, and Evolutionary Computation, we have marvellous feats of technology like self-driving cars and self-learning expert systems that perform enormously complex tasks like playing the game of Go (not to be confused with the Go programming language). The main reason these feats are possible is the success of deep learning and reinforcement learning (more on the latter given in the next section below). Some of the important algorithms and applications that data scientists have to be aware of in deep learning are:
Long Short term Memories (LSTMs) for Natural Language Processing
Recurrent Neural Networks (RNNs) for Speech Recognition
Convolutional Neural Networks (CNNs) for Image Processing
Deep Neural Networks (DNNs) for Image Recognition and Classification
Hybrid Architectures for Recommender Systems
Autoencoders (ANNs) for Bioinformatics, Wearables, and Healthcare
Deep Learning Networks typically have millions of neurons and hundreds of millions of connections between neurons. Training such networks is such a computationally intensive task that now companies are turning to the 1) Cloud Computing Systems and 2) Graphical Processing Unit (GPU) Parallel High-Performance Processing Systems for their computational needs. It is now common to find hundreds of GPUs operating in parallel to train ridiculously high dimensional neural networks for amazing applications like dreaming during sleep and computer artistry and artistic creativity pleasing to our aesthetic senses.
Artistic Image Created By A Deep Learning Network. From blog.kadenze.com.
For more on Deep Learning, please visit the following links:
In the recent past and the last three years in particular, reinforcement learning has become remarkably famous for a number of achievements in cognition that were earlier thought to be limited to humans. Basically put, reinforcement learning deals with the ability of a computer to teach itself. We have the idea of a reward vs. penalty approach. The computer is given a scenario and ‘rewarded’ with points for correct behaviour and ‘penalties’ are imposed for wrong behaviour. The computer is provided with a problem formulated as a Markov Decision Process, or MDP. Some basic types of Reinforcement Learning algorithms to be aware of are (some extracts from Wikipedia):
Q-Learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model (hence the connotation “model-free”) of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations. For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. “Q” names the function that returns the reward used to provide the reinforcement and can be said to stand for the “quality” of an action taken in a given state.
State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy. This name simply reflects the fact that the main function for updating the Q-value depends on the current state of the agent “S1“, the action the agent chooses “A1“, the reward “R” the agent gets for choosing this action, the state “S2” that the agent enters after taking that action, and finally the next action “A2” the agent choose in its new state. The acronym for the quintuple (st, at, rt, st+1, at+1) is SARSA.
3.Deep Reinforcement Learning
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. Remarkably, the computer agent DeepMind has achieved levels of skill higher than humans at playing computer games. Even a complex game like DOTA 2 was won by a deep reinforcement learning network based upon DeepMind and OpenAI Gym environments that beat human players 3-2 in a tournament of best of five matches.
For more information, go through the following links:
If reinforcement learning was cutting edge data science, AutoML is bleeding edge data science. AutoML (Automated Machine Learning) is a remarkable project that is open source and available on GitHub at the following link that, remarkably, uses an algorithm and a data analysis approach to construct an end-to-end data science project that does data-preprocessing, algorithm selection,hyperparameter tuning, cross-validation and algorithm optimization to completely automate the ML process into the hands of a computer. Amazingly, what this means is that now computers can handle the ML expertise that was earlier in the hands of a few limited ML practitioners and AI experts.
AutoML has found its way into Google TensorFlow through AutoKeras, Microsoft CNTK, and Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS). Currently it is a premiere paid model for even a moderately sized dataset and is free only for tiny datasets. However, one entire process might take one to two or more days to execute completely. But at least, now the computer AI industry has come full circle. We now have computers so complex that they are taking the machine learning process out of the hands of the humans and creating models that are significantly more accurate and faster than the ones created by human beings!
The basic algorithm used by AutoML is Network Architecture Search and its variants, given below:
Network Architecture Search (NAS)
PNAS (Progressive NAS)
ENAS (Efficient NAS)
The functioning of AutoML is given by the following diagram:
If you’ve stayed with me till now, congratulations; you have learnt a lot of information and cutting edge technology that you must read up on, much, much more. You could start with the links in this article, and of course, Google is your best friend as a Machine Learning Practitioner. Enjoy machine learning!
So you want to learn data science but you don’t know where to start? Or you are a beginner and you want to learn the basic concepts? Welcome to your new career and your new life! You will discover a lot of things on your journey to becoming a data scientist and being part of a new revolution. I am a firm believer that you can learn data science and become a data scientist regardless of your age, your background, your current knowledge level, your gender, and your current position in life. I believe – from experience – that anyone can learn anything at any stage in their lives. What is required is just determination, persistence, and a tireless commitment to hard work. Nothing else matters as far as learning new things – or learning data science – is concerned. Your commitment, persistence, and your investment in your available daily time is enough.
I hope you understood my statement. Anyone can learn data science if you have the right motivation. In fact, I believe anyone can learn anything at any stage in their lives, if they invest enough time, effort and hard work into it, along with your current occupation. From my experience, I strongly recommend that you continue your day job and work on data science as a side hustle, because of the hard work that will be involved. Your commitment is more important than your current life situation. Carrying on a full-time job and working on data science part-time is the best way to go if you want to learn in the best possible manner.
Technical Concepts of Data Science
So what are the important concepts of data science that you should know as a beginner? They are, in order of sequential learning, the following:
Statistics & Probability
Data Preparation and Data ETL*
Machine Learning with Python and R
Data Visualization and Summary
*Extraction, Transformation, and Loading
Now if you were to look at the above list an go to a library, you would, most likely, come back with 9-10 books at an average of 1000 pages each. Even if you could speed-read, 10,000 pages is a lot to get through. I could list the best books for each topic in this post, but even the most seasoned reader would balk at 10,000 pages. And who reads books these days? So what I am going to give you is a distilled extract on each of those topics. Keep in mind, however, that every topic given above could be a series of blog posts in its own right, and these 80-word paragraphs are just a tiny taste of each topic and there is an ocean of depth involved in every topic. You might ask if that is the case, how can everybody be a possible candidate for data scientist role? Two words: Persistence and Motivation. With the right amount of these two characteristics, anyone can be anything they want to be.
1) Python Programming:
Python is one of the most popular programming languages in the world. It is the ABC of data science because Python is the language every beginner starts with on data science. It is universally used for any purposes since it is so amazingly versatile. Python can be used for web applications and websites with Django, microservices with Flask, general programming projects with the standard library from PyPI, GUIs with PyQt5 or Tkinter, Interoperability with Jython (Java), Cython (C) and nearly other programming language are available today.
Of course, Python is the also first language used for data science with the standard stack of scikit-learn (machine learning), pandas (data manipulation), matplotlib and seaborn (visualization) and numpy (vectorized computation). Nowadays, the most common technology used is the Anaconda distribution, available from www.anaconda.com. Current version 2018.12 or Anaconda Distribution 5. To learn more about Python, I strongly recommend the following books: Head First Python and the Python Cookbook.
2) R Programming
R is The Best Language for statistical needs since it is a language designed by statisticians, for statisticians. If you know statistics and mathematics well, you will enjoy programming in R. The language gives you the best support available for every probability distribution, statistics functions, mathematical functions, plotting, visualization, interoperability, and even machine learning and AI. In fact, everything that you can do in Python can be done in R. R is the second most popular language for data science in the world, second only to Python. R has a rich ecosystem for every data science requirement and is the favorite language of academicians and researchers in the academic domain.
Learning Python is not enough to be a professional data scientist. You need to know R as well. A good book to start with is R For Data Science, available at Amazon at a very reasonable price. Some of the most popular packages in R that you need to know are ggplot2, ThreeJS, DT (tables), network3D, and leaflet for visualization, dplyr and tidyr for data manipulation, shiny and R Markdown for reporting, parallel, Rcpp and data.table for high performance computing and caret, glmnet, and randomForest for machine learning.
3) Statistics and Probability
This is the bread and butter of every data scientist. The best programming skills in the world will be useless without knowledge of statistics. You need to master statistics, especially practical knowledge as used in a scientific experimental analysis. There is a lot to cover. Any subtopic given below can be a blog-post in its own right. Some of the more important areas that a data scientist needs to master are:
Succinctly, linear algebra is about vectors, matrices and the operations that can be performed on vectors and matrices. This is a fundamental area for data science since every operation we do as a data scientist has a linear algebra background, or, as data scientists, we usually work with collections of vectors or matrices. So we have the following topics in Linear Algebra, all of which are covered in the following world-famous book, Linear Algebra and its Applications by Gilbert Strang, an MIT professor. You can also go to the popular MIT OpenCourseWare page, Linear Algebra (MIT OCW). These two resources cover everything you need to know. Some of the most fundamental concepts that you can also Google or bring up on Wikipedia are:
5) Data Preparation and Data ETL (Extraction, Transformation, and Loading)
By IAmMrRob on Pixabay
Yes – welcome to one of the more infamous sides of data science! If data science has a dark side, this is it. Know for sure that unless your company has some dedicated data engineers who do all the data munging and data wrangling for you, 90% of your time on the job will be spent on working with raw data. Real world data has major problems. Usually, it’s unstructured, in the wrong formats, poorly organized, contains many missing values, contains many invalid values, and contains types that are not suitable for data mining.
Dealing with this problem takes up a lot of the time of a data scientist. And your data scientist’s analysis has the potential to go massively wrong when there is invalid and missing data. Practically speaking, unless you are unusually blessed, you will have to manage your own data, and that means conducting your own ETL (Extraction, Transformation, and Loading). ETL is a data mining and data warehousing term that means loading data from an external data store or data mart into a form suitable for data mining and in a state suitable for data analysis (which usually involves a lot of data preprocessing). Finally, you often have to load data that is too big for your working memory – a problem referred to as external loading. During your data wrangling phase, be sure to look into the following components:
Automating the Data ETL Pipeline
Automation of Data Validation and Verification
Usually, expert data scientists try to automate this process as much as possible, since a human being would be wearied by this task very fast and is remarkably prone to errors, which will not happen in the case of a Python or an R script doing the same operations. Be sure to try to automate every stage in your data processing pipeline.
6) Machine Learning with Python and R
An expert machine learning scientist has to be proficient in the following areas at the very least:
Data Science Topics Listing – Thomas
Now if you are just starting out in Machine Learning (ML), Python, and R, you will gain a sense of how huge the field is and the entire set of lists above might seem more like advanced Greek instead of Plain Jane English. But not to worry; there are ways to streamline your learning and to consume as little time as possible in learning or becoming able to learn nearly every single topic given above. After you learn the basics of Python and R, you need to go on to start building machine learning models. From experience, I suggest you break up your time into 50% of Python and 50% of R and spend as much time as possible spending time without switching your languages or working between languages. What do I mean? Spend maximum time learning one programming language at one time. That will prevent syntax errors and conceptual errors and language confusion problems.
Now, on the job, in real life, it is much more likely that you will work in a team and be responsible for only one part of the work. However, if your working in a startup or learning initially, you will end up doing every phase of the work yourself. Be sure to give yourself time to process information and to spend sufficient time for your brain to rest and get a handle on the topics you are trying to learn. For more info, do check out the Learning How to Learn MOOC on Coursera, which is the best way to learn mathematical or scientific topics without ending up with burn out. In fact, I would recommend this approach to every programmer out there trying to learn a programming language, or anything considered difficult, like Quantum Mechanics and Quantum Computation or String Theory, or even Microsoft F# or Microsoft C# for a non-Java programmer.
Common tools that you have with which you can produce powerful visualizations include:
Google Data Studio
Microsoft Power BI Desktop
Some involve coding, some are drag-and-drop, some are difficult for beginners, some have no coding at all. All of these tools will help you with data visualization. But one of the most overlooked but critical practical functions of a data scientist has been included under this heading: summarisation.
Summarisation means the practical result of your data science workflow. What does the result of your analysis mean for the operation of the business or the research problem that you are currently working on? How do you convert your result to the maximum improvement for your business? Can you measure the impact this result will have on the profit of your enterprise? If so, how? Being able to come out of a data science workflow with this result is one of the most important capacities of a data scientist. And most of the time, efficient summarisation = excellent knowledge of statistics. Please know for sure that statistics is the start and the end of every data science workflow. And you cannot afford to be ignorant about it. Refer to the section on statistics or google the term for extra sources of information.
How Can I Learn Everything Above In the Shortest Possible Time?
You might wonder – How can I learn everything given above? Is there a course ora pathway to learn every single concept described in this article at one shot? It turns out – there is. There is a dream course for a data scientist that contains nearly everything talked about in this article.
Want to Become a Data Scientist? Welcome to Dimensionless Technologies! It just so happens that the course: Data Science using Python and R, a ten-week course that includes ML, Python and R programming, Statistics, Github Account Project Guidance, and Job Placement, offers nearly every component spoken about above, and more besides. You don’t know to buy the books or do any of the courses other than this to learn the topics in this article. Everything is covered by this single course, tailormade to convert you to a data scientist within the shortest possible time. For more, I’d like to refer you to the following link:
Does this seem too good to be true? Perhaps, because this is a paid course. With a scholarship concession, you could end up paying around INR 40,000 for this ten-week course, two weeks of which you can register for 5,000 and pay the remainder after two weeks trial period to see if this course really suits you. If it doesn’t, you can always drop out after two weeks and be poorer by just 5k. But in most cases, this course has been found to carry genuine worth. And nothing worthwhile was achieved without some payment, right?
In case you want to learn more about data science, please check out the following articles:
Data science is a rapidly growing career path in the 21 century. The leaders across all industries, fields, and governments are putting their best minds to the task of harnessing the power of data.
As organizations seek to derive greater insights and to present their findings with greater clarity, the premium placed on high-quality data visualization will only continue to increase.
What are Data Visualisation tools?
Data visualization is a general term that describes an effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.
Furthermore, today’s data visualization tools go beyond the standard charts and graphs used in Microsoft Excel spreadsheets, displaying data in more sophisticated ways such as infographics, dials and gauges, geographic maps, sparklines, heat maps, and detailed bar, pie and fever charts.
What is Tableau?
Tableau is a powerful and fastest growing data visualization tool used in the Business Intelligence Industry. It helps in simplifying raw data into the very easily understandable format.
Also, data analysis is very fast with Tableau and the visualizations created are in the form of dashboards and worksheets. The data that is created using Tableau can be understood by professional at any level in an organization. Furthermore, It even allows a non-technical user to create a customized dashboard.
1. Tableau makes interacting with your data easy
Tableau is a very effective tool to create interactive data visualizations very quickly. It is very simple and user-friendly. Tableau can create complex graphs giving a similar feel as the pivot table graphs in Excel. Moreover, it can handle a lot more data and quickly provide calculations on datasets.
2. Extensive Data analytics
Tableau allows the user to plot varied graphs which can help in making detailed data visualisations. There are 21 different types of graph among which users can mix match and dish out appealing and informative visualisations. From heat maps, pie chart and bar charts to bubbe graph, Gantt chart and bullet graphs, Tableau has way more lot of visualisations to offer than other data visualisations tool out there
3. Easy Data discovery
Tableau is capable of handling large datasets really well. Handling large dataset is one problem where tools like MS Excel and even R shiny fails to generate visualisation dashboards. Ability to handle such large chunks of data empowers tableau to generate insights out of it. This, in turn, allows users to find patterns and trends in their data. Furthermore, tableau can be connected to multiple data sources be it different cloud providers or databases or data warehouses.
4. No Coding
The one great thing about tableau is that you do not need to code at all to generate powerful and meaningful visualisations. It is all a game of selecting a chart and drag and drop! Being user-friendly allows the user to focus more on visualisations and storytelling through it rather than handling all the coding aspects around it.
5. Huge Community
Tableau boasts of a large user community which works for solving doubts and problems faced while using Tableau. Having such large community support helps users to find answers to their queries and issues faced while using Tableau. One does not need to worry about having less learning material too.
6. Proved to have satisfied customers
Tableau users are genuinely happy with the product. For example, the yearly Gartner research about Business Intelligence and Analytics Platforms, based on the user feedback, indicates Tableau´s success and ability to deliver a genuinely user-friendly solution for the customers. We have noticed the same enthusiasm and positive feedback about Tableau among our customers.
7. Mobile support
Tableau provides mobile support for the dashboards. So you do not need to confine to just desktop and laptops but can develop visualisations on the fly using Tableau
Tableau in fortune 500 companies
LinkedIn has over 460 million users. The business analytics team of LinkedIn’s salesforce is massively using Tableau to process petabytes of customer data. They access Tableau server on a weekly basis by 90% of LinkedIn’s salesforce. Furthermore, sales analytics can measure performance and gauge the churn using Tableau dashboards. Higher revenue, therefore, results due to a more proactive sales cycle. Michael Li, Senior Director of Business Analytics at LinkedIn believes that LinkedIn’s analytics portal is the go-to destination for salespeople to get what they require to convey that information that is exactly required by the clients.
Cisco uses Tableau software to work with 14,000 items to evaluate Product Demand Variability, match distribution centres with customers, depict the flow of goods through the supply chain network, assess the location and spend within the supply chain. Tableau strikes a balance of a sophisticated network of suppliers to the end customer. This looks after inventory and reduces order-to-ship cycle. Also, Cisco uses Tableau server to spread the content gracefully. It helps to create the right message, streamline the data, drive the conception and also in the scaling of data.
Deloitte uses Tableau to help customers implement self-reliant data-driven culture which is also agile which can garner high business value from enterprise data. Higher signal detection abilities and real-time interactive dashboards are available to an enterprise by Deloitte that allow their clients to assess huge complex datasets with high efficiency and greater ease of use. Furthermore, there are more than 5000 Deloitte employees who are trained in Tableau and are successfully delivering high-end projects.
Walmart considers it was a good move shifting to rich vivid visualizations that can be modified in real time and shared easily from Excel sheets. Furthermore, they found that people responded better when there is more creativity, the presentation would turn to be good, and executives receive it better. Rather than a datasheet, Tableau is used to convey data story more effectively. Also, they had built dashboards which could be accessible to the entire organization. Over 5000 systems have Tableau desktop in Walmart and it is doing great with this BI tool.
After reading this list we hope you are ready to conquer the world of data with Tableau. To help you to just do it, we offer data science courses including Tableau. Also, you can view the course here.
Additionally, if you are interested in learning Big Data and NLP, click here to get started
Furthermore, if you want to read more about data science, you can read our blogs here
Also, the following are some suggested blogs you may like to read