Data Science is a study which deals with the identification, representation, and extraction of meaningful information from data. It can be collected from different sources to be used for business purposes.
With an enormous amount of facts generating each minute, the requirement to extract the useful insights is a must for the businesses. It helps them stand out from the crowd. Data engineers set up the data storage in order to facilitate the process of data mining, data munging activities. Every other organization is running behind profits. But the companies that formulate effective strategies based on insights always win the game in the long-run.
In this blog, we will be discussing new advancements or trends in the data science industry. Consecutively, these advancements are enabling it to tackle some of the trickiest problems across various businesses.
Top 5 Trends
Analytics and associated data technologies have emerged as core business disruptors in the digital age. As companies began the shift from being data-generating to data-powered organizations in 2017, data and analytics became the centre of gravity for many enterprises. In 2018, these technologies need to start delivering value. Here are the approaches, roles, and concerns that will drive data analytics strategies in the year ahead.
The Data Science Trends for 2018 are largely a continuation of some of the biggest trends of 2017 including Big Data, Artificial Intelligence (AI), Machine Learning (ML), along with some newer technologies like Blockchain, Serverless Computing, Augment Reality, and others that employ various practices and techniques within the Data Science industry.
If I am to pick top 5 data science trends right now (which can be very subjective but I will try it to justify the most), I will list them as
Artificial Intelligence
Cloud Services
AR/VR Systems
IoT Platforms
Big Data
Let us understand each of them in bit more detail!
Artificial Intelligence
Artificial intelligence (AI) is not new. It has been around for decades. However, due to greater processing speeds and access to vast amounts of rich data, AI is beginning to take root in our everyday lives.
From natural language generation and voice or image recognition to predictive analytics, machine learning, and driverless cars, AI systems have applications in many areas. These technologies are critical to bringing about innovation, providing new business opportunities and reshaping the way companies operate.
Artificial Intelligence is itself a very broad area to explore and study. But there are some components within artificial intelligence which are making quite a buzz around with their applications across business lines. Let us have a look at them one by one.
Natural language Processing
With advances in computational power and the integration of artificial intelligence, the natural language processing domain has evolved into a whirlwind of innovation. In fact, experts expect the NLP market to swell to an impressive $22.3 billion by 2025. One of the many applications of NLP in business is chatbots. Chatbots demonstrate utility in the customer service realm. These automated helpers can take care of simple frequently asked questions and other lookup tasks. This leaves customer service agents free to devote time to troubleshooting bigger matters that personalize and enhance the customer experience. Chatbots can save valuable time and energy for all members of the value stream. Chatbot technology is poised for considerable growth as speech and language processing tools become more robust by expanding beyond rules-based engines to include neural conversational models.
Deep Learning
You might think that Deep Learning sounds a lot like Artificial Intelligence, and that’s true to a point. Artificial Intelligence is a machine developed with the capability for intelligent thinking. Deep Learning, on the other hand, is an approach to Machine Learning which involves Artificial Neural Networks to work with the data. Today, there are more Deep Learning business applications than ever. In different cases, it can be the core offering of the product, such as self-driving cars. Over the past few years. It is found powering some of the world’s most powerful tech today: everything from entertainment media to self-driving cars. Some of the applications of deep learning in business include recommender systems, self-driving cars, image detection, and object classification.
Reinforcement Learning
The reinforcement learning model prophesies interaction between two elements — Environment and the learning agent. The learning agent leverages two mechanisms namely exploration and exploitation. When the learning agent acts on trial and error, it is termed as exploration, and when it acts based on the knowledge gained from the environment, it is referred to as exploitation. The environment rewards the agent for corrective actions, which is the reinforcement signal. Leveraging the rewards obtained, the agent improves its environment knowledge to select the next action. Now, artificial agents are being created to perform the tasks as a human. These agents have made their presence felt in businesses, and the use of agents driven by reinforcement learning is cut across industries. Some of the practical applications of reinforcement learning include robots driven in the factory, space management in warehouses, dynamic pricing agents, and driving financial investment decisions.
Cloud Services
The complexity in data science is increasing by the day. This complexity is driven by fundamental factors like increased data generation, low-cost storage, and cheap computational power. So, in summary, we are generating far more data, we can store it at a low cost and can run computations and simulations on this data at a low cost!
To tackle the increasing complexities in data science here is why we need cloud services
Need to run scalable data science
Cost
The larger ecosystem for machine learning system deployments
Use for building quick prototypes
In the field of cloud services, we have 3 major players in this field leading the pack. AWS(Amazon), Azure(Microsoft), GCP(Google).
Augmented Reality/Virtual Reality Systems
The Immersive Experience related to augmented reality (AR) and virtual reality (VR) is already changing the world around us. The human-machine interaction will improve as research breakthroughs in AR and VR come about. It is a claim made in a Gartner report, Augmented Analytics is the future of Data and Analytics, published in July 2017. Augmented analytics automates data insights through machine learning and natural language processing, enabling analysts to find patterns and prepare smart data that can be easily shared and operationalized. Accessible augmented analytics produces citizen data scientists and make an organization more agile.
IoT Platforms
Internet of things refers to a network of objects, each of which has a unique IP address & can connect to the internet. These objects can be people, animal and day to day devices like your refrigerator and your coffee machine. These objects can connect to the internet (and to each other) and communicate with each other through this net, in ways which have not been thought before. The data from current IoT pilot rollouts (sensors, smart meters, etc) will be used to make smart decisions using predictive analytics. E.g., forecast electricity usage from each smart meter to better plan distribution; forecast power output of each wind turbine in a wind farm; predictive maintenance of machines, etc.
The power of Big Data
Big data is a term to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with.
It was a significant trend in data science in 2017 but lately, there have been some advancements in Big data lately which has made it a trend in 2018 too. Let us have a look at some of them
Block Chain
Data science is a central part of virtually everything — from business administration to running local and national governments. At its core, the subject aims at harvesting and managing data so organizations can run smoothly. For some time now, data scientists have been unable to share, secure and authenticate data integrity. Thanks to bitcoin being overly hyped, the blockchain, the technology that underpins it, got the attentive eyes of data specialists. Blockchain Improves data integrity, provides easy and trusted means of data sharing capabilities, enable real-time analysis and data traceability. With robust security and transparent record keeping, blockchain is set to help data scientists achieve many milestones that were previously considered impossible. Although the decentralized digital ledgers are still a novice technology, the preliminary results from companies experimenting on them, like IBM and Walmart, prove that they work.
Handling Datastreams
Stream Processing is a Big data technology. It enables users to query continuous data stream and detect conditions fast within a small time period from the time of receiving the data. The detection time period may vary from few milliseconds to minutes. For example, with stream processing, you can receive an alert by querying a data streams coming from a temperature sensor and detecting when the temperature has reached the freezing point. Streaming data possesses immense capabilities which makes it a running trend in Big data till date.
Apache Spark
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. With apache releasing new features time by time to its spark library (Spark streaming, GraphX etc), it has been able to maintain its hold as a trend in Big Data till data
Conclusion
This is only the beginning, as data science continues to serve as the catalyst in the changes you are going to experience in business and technology. It is now up to you on how to efficiently adapt to these changes and help your own business flourish.
A data science project requires numerous iterations that are time-consuming. When dealing with numbers and data interpretations, it goes without question that you have to be quite smart and proactive.
It’s not surprising that iterations can be frustrating if they require regular updates. Sometimes the model is six months old that needs current information or other times you miss out on some data, so the analysis has to be done all over again. In this article, we will focus on the ways by which the Data Scientists and Business Analysts can increase productivity without spending so much time on unhealthy iterations.
Tips and Tricks for data scientists
Keeping the bigger picture in mind
Long-term goals should be considered a priority when doing the analysis. There could be many small issues rising up but that shouldn’t outcast the bigger ones. Be observant in deciding the problems that are going to affect the organization on a larger scale. Focus on those bigger problems and look for stable solutions. A data Scientists and Business analysts have to be visionary to manifest solutions.
Understanding the problem and keeping the requirements at hand
Data science is not about implementing a fancy/complex algorithm or doing some complex data aggregation. Data science is more about providing a solution to the problem at hand. All the tools like ML, visualization or optimization algorithms are just meant through which one can arrive at a suitable solution. Always understand the problem you are trying to solve. One should not jump directly to machine learning or statistic right after getting the data. We should analyze what data is about and what all you need to know and perform to come to the solution of your problem. Also, it is important to always keep an eye of the feasibility of the solution in terms of implementation. A good solution is always the one which is easily implementable. Always know what all you need to achieve a solution to the problems.
More real-world oriented approach
Data science involves providing a solution to real-world use cases. Hence one should always keep a real-world oriented approach. One should always focus on the domain/business use case of the problem at hand and the solution to be implemented rather than just purely looking at it from the technical side. Technical aspect focusses on the correctness of the solution but the business aspect focusses on the implementation and usage aspect of the solution. Sometimes you may not need a complex incomprehensive algorithm to meet your requirements rather you are happier with a simple algorithm which may not give as a correct result as previous one but its accuracy can be traded with its comprehensible attribute. Knowledge of technical aspect is a must but
Not everything is ML
Recently, machine learning has seen a great advancement in its application in various business applications. With great prediction capabilities, machine learning can solve many of the complex problems in various business scenarios. But one should not that, data science is not about only machine learning. Machine learning is just a small part of it. Data science is more about arriving at a feasible solution for a given problem. One should focus on areas like data cleaning, data visualization, and ability to extensively explore the data and find relations between the various attributes. It is about the ability to crunch out meaningful numbers which matter the most. A good data scientist focusses more on all the above qualities rather than just trying to fit machine learning algorithms on the problem statements
Programming Languages
It is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Data cleaning and EDA
Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand — things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method. Let us say you are cleaning data for language processing tasks, and simple models might give you the best result. Cleaning is one of the most complex processes in data science, since almost every data available or extracted for language processing tasks is unstructured. It is a fact that a highly processed and neatly structured data will yield better results than a noisy one. We should rather try to perform cleaning task with simple regular expressions rather than using complex tools
Always open to learning more and more
“Data Science is a journey, not a destination”. This line gives us an insight into how huge the data science domain is and why constant learning is as important as build intelligent models. Practitioners who keep themselves updated with the new tech being developed every day, are able to implement and solve business problems faster. With all the resources available on the internet like MOOCs, one can easily make use of these to be updated. Also showcasing your skill on your blog or Github is an important hack which most of us are unaware of. This not only benefits their “The man who is too old to learn was probably always too old to learn.”
Evaluating Models and avoiding overfit
Separate the data into two sets ౼ the training set and the testing set to get a stronger prediction of an outcome. Cross-validation is the most convenient method to analyze numerical data without over-fitting. It examines the out-of-sample fit.
Converting findings into the actions
Again, this might sound like a simple tip – but you see both the beginners as well as the advanced people falter on it. The beginners would perform steps in excel, which would include copy paste of data. For the advanced users, any work done through command line interface might not be reproducible. Similarly, you need to extra cautious while working with notebooks. You should control your urge to go back and change any previous step which uses the dataset which has been computed later in the flow. Notebooks are very powerful to maintain a flow. If we do not maintain the flow, it can be very tardy as well.
Taking Rest
When do I work the best? It’s when I provide myself a 2–3 hours window to work on a problem/project. You can’t multi-task as a data scientist. You need to focus on a single problem at a time to make sure you get the best out of yourself. 2– 3-hour chunks work best for me, but you can decide yours.
Conclusion
Data science requires continuous learning and it is more of a journey rather than a destination. One always keep learning more and more about data science hence one should always keep above tricks and tips in his/her arsenal to boost up the productivity of their own self and are able to deliver more value to complex problems which can be solved with simple solutions! Stay tuned for more articles on data science.
Building spam detection classifier using Machine learning and Neural Networks
Introduction
On our path of building an SMS SMAP classifier, we have till now converted our text data into a numeric form with help of a bag of words model. Using TF-IDF approach, we have now numeric vectors that describe our text data.
In this blog, we will be building a classifier that will help us to identify whether an incoming message is a spam or not. We will be using both machine learning and neural network approach to accomplish building classifier. If you are directly jumping to this blog then I will recommend you to go through part 1 and part 2 of building SPAM classifier series. Data used can be found here
Assessing the problem
Before jumping to machine learning, we need to identify what do we actually wish to do! We need to build a binary classifier which will look at a text message and will tell us whether that message is a spam or not. So we need to pick up those machine learning models which will help us to perform a classification task! Also note that this problem is a case of binary classification problem, as we have only two output classes into which texts will be classified by our model (0 – Message is not a spam, 1- Message is a spam)
We will build 3 machine learning classifiers namely SVM, KNN, and Naive Bayes! We will be implementing each of them one by one and in the end, have a look at the performance of each
Building an SVM classifier (Support Vector Machine)
A Support Vector Machine (SVM) is a discriminative classifier which separates classes by forming hyperplanes. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space, this hyperplane is a line dividing a plane into two parts wherein each class lay in either side.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30)
Till now, we have trained our model on the training dataset and have evaluated on a test set ( a data which our model has not seen ever). We have also performed a cross-validation over the classifier to make sure over trained model is free from any bias and variance issues!
Our SVM model with the linear kernel on this data will have a mean accuracy of 97.61% with 0.85 standard deviations. Cross-validation is important to tune the parameters of the model. In this case, we will select different kernels available with SVM and find out the best working kernel in terms of accuracy. We have reserved a separate test set to measure how well the tuned model is working on the never seen before data points.
Building a KNN classifier (K- nearest neighbor)
K-Nearest Neighbors (KNN) is one of the simplest algorithms which we use in Machine Learning for regression and classification problem. KNN algorithms use data and classify new data points based on similarity measures (e.g. distance function). Classification is done by a majority vote to its neighbors. The data is assigned to the class which has the most number of nearest neighbors. As you increase the number of nearest neighbors, the value of k, accuracy might increase.
Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:
Let’s explain what each of these terms means.
“P” is the symbol to denote probability.
P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
P(B | A) = The probability of event B (evidence) occurring given that A (hypothesis) has occurred.
P(A) = The probability of event B (hypothesis) occurring.
P(B) = The probability of event A (evidence) occurring.
Below is the code snippet for multinomial Naive Bayes classifier
We have till now implemented 3 classification algorithms for finding out the SPAM messages
SVM (Support Vector Machine)
KNN (K nearest neighbor)
Multinomial Naive Bayes
SVM, with the highest accuracy (97%), looks like the most promising model which will help us to identify SPAM messages. Anyone can say this by just looking at the accuracy right? But this may not be the actual case. In the case of classification problems, accuracy may not be the only metric you may want to have a look at. Feeling confused? I am sure you will be and allow me to introduce you to our friend Confusion Matrix which will eventually sort all your confusion out
Confusion Matrix
A confusion matrix, also known as error matrix, is a table which we use to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.
A sample confusion matrix for 2 classes
Definition of the Terms:
• Positive (P): Observation is positive (for example: is a SPAM).
• Negative (N): Observation is not positive (for example: is not a SPAM).
• True Positive (TP): Observation is positive, and the model predicted positive.
• False Negative (FN): Observation is positive, but the model predicted negative.
• True Negative (TN): Observation is negative, and the model predicted negative.
• False Positive (FP): Observation is negative, but the model predicted positive.
Let us bring two other metrics apart from accuracy which will help us to have a better look at our 3 models
Recall:
The recall is the ratio of the total number of correctly classified positive examples divided to the total number of positive examples. High Recall indicates the class is correctly recognized (small number of FN).
Precision:
To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labelled as positive is indeed positive (small number of FP).
Let us have a look at the confusion matrix of our SVM classifier and try to understand it. Consecutively, we will be summarising confusion matrix of all our 3 classifiers
Given below is the confusion matrix of the results which our SVM model has predicted on the test data. Let us find out accuracy, precision and recall in this case.
In Machine Learning, performance measurement is an essential task. So when it comes to a classification problem, we can count on an AUC – ROC Curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under theReceiver Operating Characteristics)
AUC – ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.
We plot a ROC curve with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.
Let us have a look at the ROC curve of our SVM classifier
Always remember that the closer AUC (Area under the curve) is to value 1, the better the classification ability of the classifier. Furthermore, let us also have a look at the ROC curve of our KNN and Naive Bayes classifier too!
The graph on the left is for KNN and on the right is for Naive Bayes classifier. This clearly indicates that Naive Bayes classifier, in this case, is much more efficient than our KNN classifier as it has a higher AUC value!
Conclusion
In this series, we looked at understanding NLP from scratch to building our own SPAM classifier over text data. This is an ideal way to start learning NLP as it covers basics of NLP, word embeddings and numeric representations of text data and modeling over those numeric representations. You can also try neural networks for NLP as they are able to achieve good performance! Stay tuned for more on NLP in coming blogs.
Artificial Intelligence is growing at a rapid pace in the last decade. You have seen it all unfold before your eyes. From self-driving cars to Google Brain, artificial intelligence has been at the centre of these amazing huge-impact projects.
Artificial Intelligence (AI) made headlines recently when people started reporting that Alexa was laughing unexpectedly. Those news reports led to the usual jokes about computers taking over the world, but there’s nothing funny about considering AI as a career field. Just the fact that five out of six Americans use AI services in one form or another every day proves that this is a viable career option
Why AI?
Well, there can be many reasons for students selecting this as their career track or professionals changing their career track towards AI. Let us have a look at some of the points on discussing why AI!
Interesting and Exciting
AI offers applications in those domains which are challenging as well as exciting. Driverless cars, human behaviour prediction, chatbots etc are just a few examples, to begin with.
High Demand and Value
Lately, there has been a huge demand in the industry for the data scientists and AI specialists which has resulted in more jobs and higher value given at workplace
Well Paid
With high demand and loads of work to be done, this field is one of the well-paid career choices currently. In the era, when jobs were reducing and the market was saturating, AI has emerged as one of the most well-paid jobs
If you still have thoughts on why one should choose AI as their career then my answer will be as clear as the thought that “If you do not want AI to take your job, you have to take up AI”!
Level 0: Setting up the ground
If maths(too much) does not intimidate and furthermore you love to code, you can then only start looking at AI as your career. If you do enjoy optimizing algorithms and playing with maths or are passionate about it, Kudos! Level 0 is cleared and you are ready to start a career in AI.
Level 1: Treading into AI
At this stage, one should cover the basics first and when I say basics, it does not imply to get the knowledge of 4–5 concepts but indeed a lot of them(Quite a lot of them)
Cover Linear Algebra, Statistics, and ProbabilityMath is the first and foremost thing you need to cover. Start from the basics of math covering vectors, matrices, and their transformations. Then proceed to understand dimensionality, statistics and different statistical tests like z-test, chi-square tests etc. After this, you should focus on the concepts of probability like Bayes Theorem etc. Maths is the foundation step of understanding and building those complex AI algorithms which are making our life simpler!
Select a programming language
After learning and being profound in the basic maths, you need to select a programing language. I would rather suggest that you take up one or maximum two programming languages and understand it in depth. One can select from R, Python or even JAVA! Always remember, a programing language is just to make your life simpler and is not something which defines you. We can start with Python because it is abstract and provides a lot of libraries to work with. R is also evolving very fast so we can consider that too or else go with JAVA. (Only if we have a good CS background!)
Understand data structures
Try to understand the data structure i.e. how you can design a system for solving problems involving data. It will help you in designing a system which is accurate and optimized. AI is more about reaching an accurate and optimized result. Learn about the Stacks, linked lists, dictionaries and other data structures that your selected programing language has to offer
Understand Regression in complete detail
Well, this is one advice you will get from everyone. Regression is the basic implementation of maths which you have learned so far. It depicts how this knowledge can be used to make predictions in real-life applications. Having a strong grasp over regression will help you greatly in understanding the basics of machine learning. This will prepare you well for your AI career.
Move on to understand different Machine Learning models and their working
After learning regression, one should get their hands dirty with other legacy machine learning algorithms like Decision Trees, SVM, KNN, Random Forests etc. You should implement them over different problems in day to day life. One should know the working math behind every algorithm. Well, this may initially be little tough, but once you get going everything will fall in its place. Aim to be a master in AI and not just any random practitioner!
Understand the problems that machine learning solves
You should understand the use cases of different machine learning algorithms. Focus on why a certain algorithm fits one case more than the other. Then only then you will be able to appreciate the mathematical concepts which help in making any algorithm more suitable to a particular business need or a use case. Machine learning is itself divided into 3 broad categories i.e. Supervised Learning, Unsupervised Learning, and Reinforcement Learning. One needs to be better than average in all the 3 cases before you can actually step into the world of Deep Learning!
Level 2: Moving deeper into AI
This is level 2 of your journey/struggle to be an AI specialist. At this level, we deal with moving into Deep Learning but only when you have mastered the legacy of machine learning!
Understanding Neural Networks
A neural network is a type of machine learning which models itself after the human brain. This creates an artificial neural network that via an algorithm allows the computer to learn by incorporating new data. At this stage, you need to start your deep learning by understanding neural networks in great detail. You need to understand how these networks are intelligent and make decisions. Neural nets are the backbone of AI and you need to learn it thoroughly!
Unrolling the maths behind neural networks
Neural networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ which contain an ‘activation function’. Patterns are presented to the network via the ‘input layer’, which communicates to one or more ‘hidden layers’ where the actual processing is done via a system of weighted ‘connections’. The hidden layers then link to an ‘output layer’ where the answer is output. You need to learn about the maths which happens in the backend of it. Learn about weights, activation functions, loss reduction, backpropagation, gradient descent approach etc. These are some of the basic mathematical keywords used in neural networks. Having a strong knowledge of them will enable you to design your own networks. You will also actually understand from where and how neural network borrows its intelligence! It’s all maths mate.. all maths!
Mastering different types of neural networks
As we did in ML, that we learned regression first and then moved onto the other ML algos, same is the case here. Since you have learned all about basic neural networks, you are ready to explore the different types of neural networks which are suited for different use cases. Underlying maths may remain the same, the difference may lie in few modifications here and there and pre-processing of the data. Different types of Neural nets include Multilayer perceptrons, Recurrent Neural Nets, Convolutional Neural Nets, LSTMS etc
Understanding AI in different domains like NLP and Intelligent Systems
With knowledge of different neural networks, you are now better equipped to master the application of these networks to different applications in Business. You may need to build a driverless car module or a human-like chatbot or even an intelligent system which can interact with its surrounding and self-learn to carry out tasks. Different use cases require different approaches and different knowledge. Surely you can not master every field in AI as it is a very large field indeed hence I will suggest you pick up a single field in AI say Natural Language processing and work on getting the depth in that field. Once your knowledge has a good depth, then only you should think of expanding your knowledge across different domains
Getting familiar with the basics of Big Data
Although, acquiring the knowledge of Big Data is not a mandatory task but I will suggest you equip yourself with basics of Big Data because all your AI systems will be handling Big Data only and it will be a good plus to have basics of Big Data knowledge as it will help you in making more optimized and realistic algorithms
Level 3: Mastering AI
This is the final stage where you have to go all guns blazing and is the point where you need to learn less but apply more whatever you have learned till now!
Mastering Optimisation Techniques
Level 1 and 2 focus on achieving accuracy in your work but now we have to talk about optimizing it. Deep learning algorithms consume a lot of resources of the system and you need to optimize every part of it. Optimization algorithms help us to minimize (or maximize) an Objective function (another name for Error function) E(x) which is simply a mathematical function dependent on the Model’s internal learnable parameters. The internal parameters of a Model play a very important role in efficiently and effectively training a Model and produce accurate results. This is why we use various Optimization strategies and algorithms to update and calculate appropriate and optimum values of such model’s parameters which influence our Model’s learning process and the output of a Model.
Taking part in competitions
You should actually take part in hackathons and data science competitions on kaggle as it will enhance your knowledge more and will give you more opportunities to implement your knowledge
Publishing and Reading lot of Research Papers
Research — Implement — Innovate — Test. Keep repeating this cycle by reading on a lot of research papers related to AI. This will help you in understanding how you can just not be a practitioner but be an thrive to be an innovator. AI is still nascent and needs masters who can innovate and bring revolution to this field
Tweaking maths to roll out your own algorithms
Innovation needs a lot of research and knowledge. This is the final place where you want yourself to be to actually fiddle with the maths which powers this entire AI. Once you are able to master this art, you will be one step away in bringing a revolution!
Conclusion
Mastering AI is not something one can achieve in a short time. AI requires hard work, persistence, consistency, patience and a lot of knowledge indeed! It may be one of the hottest jobs in the industry currently. Being a practitioner or enthusiast in AI is not difficult but if you are looking at the being a master at this, one has to be as good as those who created it! It takes years and skill to be a master at anything and same is the case with AI. If you are motivated, nothing can stop you in this entire world. ( Not even an AI :P)
In the last blog, we had a look over visualizing text data and understood some basic concepts of tokenization and lemmatization. We wrote python function to perform all the operations for us. If you are directly jumping to this blog, I will highly recommend you to go through the previous blog post in which we have discussed the problem statement and some founding concepts of NLP.
We will be covering the following topics
Understanding Tf-IDF
Finding Important words using Tf-IDF
Understanding Bag of Words
Understanding Word Embedding
Different Types of word embeddings
Difference between word embeddings and Bag of words model
Preparing a word embedding for SPAM classifier
Introduction
Previously, we found out the most occurring/common words, bigrams, and trigrams from the messages separately for spam and non-spam messages. Now we need to also find out some important words that can themselves define whether a message is a spam or not. Take a note here that most occurring/common word in a set of messages may not be a keyword that determines what the entire sentence is all about.
For example, in a business article words like business, investment, acquisition are important words that may relate a sentence to a business article. Other words like money, good, building etc may be the frequent words in the messages but they do not have much relevant information to provide.
To find the important words, we will be using the method known as Term Frequency-Inverse Document Frequency (TF-IDF)
What is TF-IDF?
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.
TF means Term Frequency. It measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization.
TF = (Number of times term w appears in a document) / (Total number of terms in the document)
Second part idf stands for Inverse Document Frequency. It measures how important a term is. While computing TF, all terms are equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones.
IDF = log_e(Total number of documents / Number of documents with term w in it)
We calculate a final tf-idf score by multiplying TF score with IDF score for every word and then finally, we can filter out important words by selecting words with a higher Tf-Idf score.
Code Implementation
An example to calculate Tf-idf score for different words
Sentences = ["Ironman movie is really good. Ironman is always my favourite", "Titanic movie is very boring","Thor movie is really good"]
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(Sentences)
pd.DataFrame(features.todense(),columns=tfidf.get_feature_names())
Finding Important words using Tf-IDF
Now we will need to find out which are the most important words in both spam and non-spam messages and then we will have a look at those words in the form of the word cloud. We will analyse those words and that will help us to relate why a particular message has been marked as a spam and other as a non-spam message.
First, we import the necessary libraries. Then I have a written a function that returns a TF-IDF score for all words in the corpus
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim import corpora
from gensim import models
def get_tfidf_matrix(documents):
documents=[my_tokeniser(document) for document in documents]
dictionary = corpora.Dictionary(documents)
n_items = len(dictionary)
corpus = [dictionary.doc2bow(text) for text in documents]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
return corpus_tfidf
Then we need to map all the scores to the words in the corpus in order to find the most important words
def get_tfidf_score_dataframe(sentiment_label):
frames = get_tfidf_matrix(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"])
all_score=[]
all_words=[]
sentence_count=0
for frame in frames:
words=my_tokeniser(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"].iloc[sentence_count])
sentence_count=sentence_count+1
for i in range(0,len(frame)):
all_score.append(frame[i])
all_words.append(words[i])
tf_idf_frame=pd.DataFrame({
'Words': all_words,
'Score': all_score
})
count=0
for key, val in tf_idf_frame["Score"]:
tf_idf_frame["Score"][count] = val
count=count+1
return tf_idf_frame
Finally, we plot all the important words in the form of a word cloud
We need a way to represent text data for the machine learning algorithm and the bag-of-words model helps us to achieve that task. The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
A vocabulary of known words.
A measure of the presence of known words.
Vocabulary can be attained by tokenising the messages into different unique tokens. After getting each token, we need to score that token. This can be done in the following ways
Counts. Count the number of times each word appears in a document.
Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
TF-IDF : TF score * IDF score
How BoW works
Forming the vector
Take for example 2 text samples: The quick brown fox jumps over the lazy dogand.Never jump over the lazy dog quickly
Vectors are then formed to represent the count of each word. In this case, each text (i.e. the sentences) will generate a 10-element vector like so:
[1,1,1,0,1,1,0,1,1,0,2]
[0,1,0,1,0,1,1,1,0,1,1]
Each element represents the number of occurrence for each word in the corpus(text sample). So, in the first sentence, there is 1 count for “brown”, 1 count for “dog”, 1 count for “fox” and so on (represented by the first array). Whereas, the vector shows that there is 0 count of “brown”, 1 count for “dog” and 0 counts for “fox”, so on and so forth
Understanding Word Vectors
Word vectors are simply vectors of numbers that represent the meaning of a word.
Traditional approaches to NLP, such as one-hot encodings, do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way.
Word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric space. In simpler terms, a word vector is a row of real-valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant.
A simple representation of word vectors
Now we will look at an example of using word vectors where we will group words of similar semantics together
import numpy as np
import spacy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
nlp = spacy.load("en")
sentence = "Tiger was driving a car when he saw a fox taking the charge on a bike but in the end giraffe won the race using his aircraft"
tokens = nlp(sentence)
vectors = np.vstack([word.vector for word in tokens if word.has_vector])
pca = PCA(n_components=2)
vecs_transformed = pca.fit_transform(vectors)
vecs_transformed = np.c_[sentence.split(), vecs_transformed]
plt.figure(figsize = (16, 10), facecolor = None)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
d = pd.DataFrame(vecs_transformed)
d.columns=["Name","V1", "V2"]
v1 = [float(x) for x in d['V1']]
v2 = [float(x) for x in d['V2']]
plt.scatter(v1, v2)
for i, txt in enumerate(d['Name']):
plt.annotate(txt, (v1[i], v2[i]))
plt.show()
Preparing a bag of words model for Analysis
Below is the code snippet for converting our messages into a table which has numerical word vectors. After achieving this only, we can build our classifier using machine learning since machine learning always needs numerical inputs!
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
## Ouput
print(X)
## <5572x8672 sparse matrix of type '<class 'numpy.float64'>'
## with 73916 stored elements in Compressed Sparse Row format>
Conclusion and Further steps
Till now we have learnt to perform EDA over text data. We have also learnt about important terms in NLP like tokenization, lemmatization, stop-words, tf-idf, the bag of words, and word-vectors. These terms are essential to master NLP. After having out word embedding ready, we will proceed to actually build machine learning models. They will help us to predict whether a message is a spam or not. In the next blog, we will build machine learning and neural network models and compare their performance. We will understand shortcomings of the neural net in the case of text mining. Finally, we will move to recurrent neural networks and LSTM to wrap up the series!
Never thought that online trading could be so helpful because of so many scammers online until I met Miss Judith... Philpot who changed my life and that of my family. I invested $1000 and got $7,000 Within a week. she is an expert and also proven to be trustworthy and reliable. Contact her via: Whatsapp: +17327126738 Email:judithphilpot220@gmail.comread more
A very big thank you to you all sharing her good work as an expert in crypto and forex trade option. Thanks for... everything you have done for me, I trusted her and she delivered as promised. Investing $500 and got a profit of $5,500 in 7 working days, with her great skill in mining and trading in my wallet.
judith Philpot company line:... WhatsApp:+17327126738 Email:Judithphilpot220@gmail.comread more
Faculty knowledge is good but they didn't cover most of the topics which was mentioned in curriculum during online... session. Instead they provided recorded session for those.read more
Dimensionless is great place for you to begin exploring Data science under the guidance of experts. Both Himanshu and... Kushagra sir are excellent teachers as well as mentors,always available to help students and so are the HR and the faulty.Apart from the class timings as well, they have always made time to help and coach with any queries.I thank Dimensionless for helping me get a good starting point in Data science.read more
My experience with the data science course at Dimensionless has been extremely positive. The course was effectively... structured . The instructors were passionate and attentive to all students at every live sessions. I could balance the missed live sessions with recorded ones. I have greatly enjoyed the class and would highly recommend it to my friends and peers.
Special thanks to the entire team for all the personal attention they provide to query of each and every student.read more
It has been a great experience with Dimensionless . Especially from the support team , once you get enrolled , you... don't need to worry about anything , they keep updating each and everything. Teaching staffs are very supportive , even you don't know any thing you can ask without any hesitation and they are always ready to guide . Definitely it is a very good place to boost careerread more
The training experience has been really good! Specially the support after training!! HR team is really good. They keep... you posted on all the openings regularly since the time you join the course!! Overall a good experience!!read more
Dimensionless is the place where you can become a hero from zero in Data Science Field. I really would recommend to all... my fellow mates. The timings are proper, the teaching is awsome,the teachers are well my mentors now. All inclusive I would say that Kush Sir, Himanshu sir and Pranali Mam are the real backbones of Data Science Course who could teach you so well that even a person from non- Math background can learn it. The course material is the bonus of this course and also you will be getting the recordings of every session. I learnt a lot about data science and Now I find it easy because of these wonderful faculty who taught me. Also you will get the good placement assistance as well as resume bulding guidance from Venu Mam. I am glad that I joined dimensionless and also looking forward to start my journey in data science field. I want to thank Dimensionless because of their hard work and Presence it made it easy for me to restart my career. Thank you so much to all the Teachers in Dimensionless !read more
Dimensionless has great teaching staff they not only cover each and every topic but makes sure that every student gets... the topic crystal clear. They never hesitate to repeat same topic and if someone is still confused on it then special doubt clearing sessions are organised. HR is constantly busy sending us new openings in multiple companies from fresher to Experienced. I would really thank all the dimensionless team for showing such support and consistency in every thing.read more
I had great learning experience with Dimensionless. I am suggesting Dimensionless because of its great mentors... specially Kushagra and Himanshu. they don't move to next topic without clearing the concept.read more
My experience with Dimensionless has been very good. All the topics are very well taught and in-depth concepts are... covered. The best thing is that you can resolve your doubts quickly as its a live one on one teaching. The trainers are very friendly and make sure everyone's doubts are cleared. In fact, they have always happily helped me with my issues even though my course is completed.read more
I would highly recommend dimensionless as course design & coaches start from basics and provide you with a real-life... case study. Most important is efforts by all trainers to resolve every doubts and support helps make difficult topics easy..read more
Dimensionless is great platform to kick start your Data Science Studies. Even if you are not having programming skills... you will able to learn all the required skills in this class.All the faculties are well experienced which helped me alot. I would like to thanks Himanshu, Pranali , Kush for your great support. Thanks to Venu as well for sharing videos on timely basis...😊
I highly recommend dimensionless for data science training and I have also been completed my training in data science... with dimensionless. Dimensionless trainer have very good, highly skilled and excellent approach. I will convey all the best for their good work. Regards Avneetread more
After a thinking a lot finally I joined here in Dimensionless for DataScience course. The instructors are experienced &... friendly in nature. They listen patiently & care for each & every students's doubts & clarify those with day-to-day life examples. The course contents are good & the presentation skills are commendable. From a student's perspective they do not leave any concept untouched. The step by step approach of presenting is making a difficult concept easier. Both Himanshu & Kush are masters of presenting tough concepts as easy as possible. I would like to thank all instructors: Himanshu, Kush & Pranali.read more
When I start thinking about to learn Data Science, I was trying to find a course which can me a solid understanding of... Statistics and the Math behind ML algorithms. Then I have come across Dimensionless, I had a demo and went through all my Q&A, course curriculum and it has given me enough confidence to get started. I have been taught statistics by Kush and ML from Himanshu, I can confidently say the kind of stuff they deliver is In depth and with ease of understanding!read more
If you love playing with data & looking for a career change in Data science field ,then Dimensionless is the best... platform . It was a wonderful learning experience at dimensionless. The course contents are very well structured which covers from very basics to hardcore . Sessions are very interactive & every doubts were taken care of. Both the instructors Himanshu & kushagra are highly skilled, experienced,very patient & tries to explain the underlying concept in depth with n number of examples. Solving a number of case studies from different domains provides hands-on experience & will boost your confidence. Last but not the least HR staff (Venu) is very supportive & also helps in building your CV according to prior experience and industry requirements. I would love to be back here whenever i need any training in Data science further.read more
It was great learning experience with statistical machine learning using R and python. I had taken courses from... Coursera in past but attention to details on each concept along with hands on during live meeting no one can beat the dimensionless team.read more
I would say power packed content on Data Science through R and Python. If you aspire to indulge in these newer... technologies, you have come at right place. The faculties have real life industry experience, IIT grads, uses new technologies to give you classroom like experience. The whole team is highly motivated and they go extra mile to make your journey easier. I’m glad that I was introduced to this team one of my friends and I further highly recommend to all the aspiring Data Scientists.read more
It was an awesome experience while learning data science and machine learning concepts from dimensionless. The course... contents are very good and covers all the requirements for a data science course. Both the trainers Himanshu and Kushagra are excellent and pays personal attention to everyone in the session. thanks alot !!read more
Had a great experience with dimensionless.!! I attended the Data science with R course, and to my finding this... course is very well structured and covers all concepts and theories that form the base to step into a data science career. Infact better than most of the MOOCs. Excellent and dedicated faculties to guide you through the course and answer all your queries, and providing individual attention as much as possible.(which is really good). Also weekly assignments and its discussion helps a lot in understanding the concepts. Overall a great place to seek guidance and embark your journey towards data science.read more
Excellent study material and tutorials. The tutors knowledge of subjects are exceptional. The most effective part... of curriculum was impressive teaching style especially that of Himanshu. I would like to extend my thanks to Venu, who is very responsible in her jobread more
It was a very good experience learning Data Science with Dimensionless. The classes were very interactive and every... query/doubts of students were taken care of. Course structure had been framed in a very structured manner. Both the trainers possess in-depth knowledge of data science dimain with excellent teaching skills. The case studies given are from different domains so that we get all round exposure to use analytics in various fields. One of the best thing was other support(HR) staff available 24/7 to listen and help.I recommend data Science course from Dimensionless.read more
I was a part of 'Data Science using R' course. Overall experience was great and concepts of Machine Learning with R... were covered beautifully. The style of teaching of Himanshu and Kush was quite good and all topics were generally explained by giving some real world examples. The assignments and case studies were challenging and will give you exposure to the type of projects that Analytics companies actually work upon. Overall experience has been great and I would like to thank the entire Dimensionless team for helping me throughout this course. Best wishes for the future.read more
It was a great experience leaning data Science with Dimensionless .Online and interactive classes makes it easy to... learn inspite of busy schedule. Faculty were truly remarkable and support services to adhere queries and concerns were also very quick. Himanshu and Kush have tremendous knowledge of data science and have excellent teaching skills and are problem solving..Help in interviews preparations and Resume building...Overall a great learning platform. HR is excellent and very interactive. Everytime available over phone call, whatsapp, mails... Shares lots of job opportunities on the daily bases... guidance on resume building, interviews, jobs, companies!!!! They are just excellent!!!!! I would recommend everyone to learn Data science from Dimensionless only 😊read more
Being a part of IT industry for nearly 10 years, I have come across many trainings, organized internally or externally,... but I never had the trainers like Dimensionless has provided. Their pure dedication and diligence really hard to find. The kind of knowledge they possess is imperative. Sometimes trainers do have knowledge but they lack in explaining them. Dimensionless Trainers can give you ‘N’ number of examples to explain each and every small topic, which shows their amazing teaching skills and In-Depth knowledge of the subject. Himanshu and Kush provides you the personal touch whenever you need. They always listen to your problems and try to resolve them devotionally.
I am glad to be a part of Dimensionless and will always come back whenever I need any specific training in Data Science. I recommend this to everyone who is looking for Data Science career as an alternative.
All the best guys, wish you all the success!!read more