9923170071 / 8108094992 info@dimensionless.in
Top 5 Trends in Data Science

Top 5 Trends in Data Science

Top 5 trends in Data Science

Introduction

Data Science is a study which deals with the identification, representation, and extraction of meaningful information from data. It can be collected from different sources to be used for business purposes.

With an enormous amount of facts generating each minute, the requirement to extract the useful insights is a must for the businesses. It helps them stand out from the crowd. Data engineers set up the data storage in order to facilitate the process of data mining, data munging activities. Every other organization is running behind profits. But the companies that formulate effective strategies based on insights always win the game in the long-run.

In this blog, we will be discussing new advancements or trends in the data science industry. Consecutively, these advancements are enabling it to tackle some of the trickiest problems across various businesses.

Top 5 Trends

Analytics and associated data technologies have emerged as core business disruptors in the digital age. As companies began the shift from being data-generating to data-powered organizations in 2017, data and analytics became the centre of gravity for many enterprises. In 2018, these technologies need to start delivering value. Here are the approaches, roles, and concerns that will drive data analytics strategies in the year ahead.

The Data Science Trends for 2018 are largely a continuation of some of the biggest trends of 2017 including Big Data, Artificial Intelligence (AI), Machine Learning (ML), along with some newer technologies like Blockchain, Serverless Computing, Augment Reality, and others that employ various practices and techniques within the Data Science industry.

top 5 data science trends

If I am to pick top 5 data science trends right now (which can be very subjective but I will try it to justify the most), I will list them as

  1. Artificial Intelligence
  2. Cloud Services
  3. AR/VR Systems
  4. IoT Platforms
  5. Big Data

Let us understand each of them in bit more detail!

Artificial Intelligence

Artificial intelligence (AI) is not new. It has been around for decades. However, due to greater processing speeds and access to vast amounts of rich data, AI is beginning to take root in our everyday lives.

From natural language generation and voice or image recognition to predictive analytics, machine learning, and driverless cars, AI systems have applications in many areas. These technologies are critical to bringing about innovation, providing new business opportunities and reshaping the way companies operate.

types of Ai

Artificial Intelligence is itself a very broad area to explore and study. But there are some components within artificial intelligence which are making quite a buzz around with their applications across business lines. Let us have a look at them one by one.

  1. Natural language Processing

    With advances in computational power and the integration of artificial intelligence, the natural language processing domain has evolved into a whirlwind of innovation. In fact, experts expect the NLP market to swell to an impressive $22.3 billion by 2025. One of the many applications of NLP in business is chatbots. Chatbots demonstrate utility in the customer service realm. These automated helpers can take care of simple frequently asked questions and other lookup tasks. This leaves customer service agents free to devote time to troubleshooting bigger matters that personalize and enhance the customer experience. Chatbots can save valuable time and energy for all members of the value stream. Chatbot technology is poised for considerable growth as speech and language processing tools become more robust by expanding beyond rules-based engines to include neural conversational models.

  2. Deep Learning

    You might think that Deep Learning sounds a lot like Artificial Intelligence, and that’s true to a point. Artificial Intelligence is a machine developed with the capability for intelligent thinking. Deep Learning, on the other hand, is an approach to Machine Learning which involves Artificial Neural Networks to work with the data. Today, there are more Deep Learning business applications than ever. In different cases, it can be the core offering of the product, such as self-driving cars. Over the past few years. It is found powering some of the world’s most powerful tech today: everything from entertainment media to self-driving cars. Some of the applications of deep learning in business include recommender systems, self-driving cars, image detection, and object classification.

  3. Reinforcement Learning

    The reinforcement learning model prophesies interaction between two elements — Environment and the learning agent. The learning agent leverages two mechanisms namely exploration and exploitation. When the learning agent acts on trial and error, it is termed as exploration, and when it acts based on the knowledge gained from the environment, it is referred to as exploitation. The environment rewards the agent for corrective actions, which is the reinforcement signal. Leveraging the rewards obtained, the agent improves its environment knowledge to select the next action. Now, artificial agents are being created to perform the tasks as a human. These agents have made their presence felt in businesses, and the use of agents driven by reinforcement learning is cut across industries. Some of the practical applications of reinforcement learning include robots driven in the factory, space management in warehouses, dynamic pricing agents, and driving financial investment decisions.

Cloud Services

The complexity in data science is increasing by the day. This complexity is driven by fundamental factors like increased data generation, low-cost storage, and cheap computational power. So, in summary, we are generating far more data, we can store it at a low cost and can run computations and simulations on this data at a low cost!

To tackle the increasing complexities in data science here is why we need cloud services

  1. Need to run scalable data science
  2. Cost
  3. The larger ecosystem for machine learning system deployments
  4. Use for building quick prototypes
cloud service providers

In the field of cloud services, we have 3 major players in this field leading the pack. AWS(Amazon), Azure(Microsoft), GCP(Google).

Augmented Reality/Virtual Reality Systems

The Immersive Experience related to augmented reality (AR) and virtual reality (VR) is already changing the world around us. The human-machine interaction will improve as research breakthroughs in AR and VR come about. It is a claim made in a Gartner report, Augmented Analytics is the future of Data and Analytics, published in July 2017. Augmented analytics automates data insights through machine learning and natural language processing, enabling analysts to find patterns and prepare smart data that can be easily shared and operationalized. Accessible augmented analytics produces citizen data scientists and make an organization more agile.

IoT Platforms

Internet of things refers to a network of objects, each of which has a unique IP address & can connect to the internet. These objects can be people, animal and day to day devices like your refrigerator and your coffee machine. These objects can connect to the internet (and to each other) and communicate with each other through this net, in ways which have not been thought before. The data from current IoT pilot rollouts (sensors, smart meters, etc) will be used to make smart decisions using predictive analytics. E.g., forecast electricity usage from each smart meter to better plan distribution; forecast power output of each wind turbine in a wind farm; predictive maintenance of machines, etc.

The power of Big Data 

Big data is a term to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with.

power of big data

It was a significant trend in data science in 2017 but lately, there have been some advancements in Big data lately which has made it a trend in 2018 too. Let us have a look at some of them

  1. Block Chain

    Data science is a central part of virtually everything — from business administration to running local and national governments. At its core, the subject aims at harvesting and managing data so organizations can run smoothly. For some time now, data scientists have been unable to share, secure and authenticate data integrity. Thanks to bitcoin being overly hyped, the blockchain, the technology that underpins it, got the attentive eyes of data specialists. Blockchain Improves data integrity, provides easy and trusted means of data sharing capabilities, enable real-time analysis and data traceability. With robust security and transparent record keeping, blockchain is set to help data scientists achieve many milestones that were previously considered impossible. Although the decentralized digital ledgers are still a novice technology, the preliminary results from companies experimenting on them, like IBM and Walmart, prove that they work.

  2. Handling Datastreams

    Stream Processing is a Big data technology. It enables users to query continuous data stream and detect conditions fast within a small time period from the time of receiving the data. The detection time period may vary from few milliseconds to minutes. For example, with stream processing, you can receive an alert by querying a data streams coming from a temperature sensor and detecting when the temperature has reached the freezing point. Streaming data possesses immense capabilities which makes it a running trend in Big data till date.

  3. Apache Spark

    Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. With apache releasing new features time by time to its spark library (Spark streaming, GraphX etc), it has been able to maintain its hold as a trend in Big Data till data

Conclusion

This is only the beginning, as data science continues to serve as the catalyst in the changes you are going to experience in business and technology. It is now up to you on how to efficiently adapt to these changes and help your own business flourish.

Stay tuned for more blogs!

Top Data Science Hacks

Top Data Science Hacks

A data science project requires numerous iterations that are time-consuming. When dealing with numbers and data interpretations, it goes without question that you have to be quite smart and proactive.

It’s not surprising that iterations can be frustrating if they require regular updates. Sometimes the model is six months old that needs current information or other times you miss out on some data, so the analysis has to be done all over again. In this article, we will focus on the ways by which the Data Scientists and Business Analysts can increase productivity without spending so much time on unhealthy iterations.

Tips and Tricks for data scientists

Keeping the bigger picture in mind

Long-term goals should be considered a priority when doing the analysis. There could be many small issues rising up but that shouldn’t outcast the bigger ones. Be observant in deciding the problems that are going to affect the organization on a larger scale. Focus on those bigger problems and look for stable solutions. A data Scientists and Business analysts have to be visionary to manifest solutions.

Understanding the problem and keeping the requirements at hand

Data science is not about implementing a fancy/complex algorithm or doing some complex data aggregation. Data science is more about providing a solution to the problem at hand. All the tools like ML, visualization or optimization algorithms are just meant through which one can arrive at a suitable solution. Always understand the problem you are trying to solve. One should not jump directly to machine learning or statistic right after getting the data. We should analyze what data is about and what all you need to know and perform to come to the solution of your problem. Also, it is important to always keep an eye of the feasibility of the solution in terms of implementation. A good solution is always the one which is easily implementable. Always know what all you need to achieve a solution to the problems.

More real-world oriented approach

Data science involves providing a solution to real-world use cases. Hence one should always keep a real-world oriented approach. One should always focus on the domain/business use case of the problem at hand and the solution to be implemented rather than just purely looking at it from the technical side. Technical aspect focusses on the correctness of the solution but the business aspect focusses on the implementation and usage aspect of the solution. Sometimes you may not need a complex incomprehensive algorithm to meet your requirements rather you are happier with a simple algorithm which may not give as a correct result as previous one but its accuracy can be traded with its comprehensible attribute. Knowledge of technical aspect is a must but

Not everything is ML

Recently, machine learning has seen a great advancement in its application in various business applications. With great prediction capabilities, machine learning can solve many of the complex problems in various business scenarios. But one should not that, data science is not about only machine learning. Machine learning is just a small part of it. Data science is more about arriving at a feasible solution for a given problem. One should focus on areas like data cleaning, data visualization, and ability to extensively explore the data and find relations between the various attributes. It is about the ability to crunch out meaningful numbers which matter the most. A good data scientist focusses more on all the above qualities rather than just trying to fit machine learning algorithms on the problem statements

Programming Languages

It is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.

Data cleaning and EDA

Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand — things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method. Let us say you are cleaning data for language processing tasks, and simple models might give you the best result. Cleaning is one of the most complex processes in data science, since almost every data available or extracted for language processing tasks is unstructured. It is a fact that a highly processed and neatly structured data will yield better results than a noisy one. We should rather try to perform cleaning task with simple regular expressions rather than using complex tools

Always open to learning more and more

“Data Science is a journey, not a destination”. This line gives us an insight into how huge the data science domain is and why constant learning is as important as build intelligent models. Practitioners who keep themselves updated with the new tech being developed every day, are able to implement and solve business problems faster.  With all the resources available on the internet like MOOCs, one can easily make use of these to be updated. Also showcasing your skill on your blog or Github is an important hack which most of us are unaware of. This not only benefits their “The man who is too old to learn was probably always too old to learn.”

Evaluating Models and avoiding overfit

Separate the data into two sets ౼ the training set and the testing set to get a stronger prediction of an outcome. Cross-validation is the most convenient method to analyze numerical data without over-fitting. It examines the out-of-sample fit.

Converting findings into the actions

Again, this might sound like a simple tip – but you see both the beginners as well as the advanced people falter on it. The beginners would perform steps in excel, which would include copy paste of data. For the advanced users, any work done through command line interface might not be reproducible. Similarly, you need to extra cautious while working with notebooks. You should control your urge to go back and change any previous step which uses the dataset which has been computed later in the flow. Notebooks are very powerful to maintain a flow. If we do not maintain the flow, it can be very tardy as well.

Taking Rest

When do I work the best? It’s when I provide myself a 2–3 hours window to work on a problem/project. You can’t multi-task as a data scientist. You need to focus on a single problem at a time to make sure you get the best out of yourself. 2– 3-hour chunks work best for me, but you can decide yours.

Conclusion

Data science requires continuous learning and it is more of a journey rather than a destination. One always keep learning more and more about data science hence one should always keep above tricks and tips in his/her arsenal to boost up the productivity of their own self and are able to deliver more value to complex problems which can be solved with simple solutions! Stay tuned for more articles on data science.

Spam Detection with Natural Language Processing – Part 3

Spam Detection with Natural Language Processing – Part 3

Building spam detection classifier using Machine learning and Neural Networks

Introduction

On our path of building an SMS SMAP classifier, we have till now converted our text data into a numeric form with help of a bag of words model. Using TF-IDF approach, we have now numeric vectors that describe our text data.

In this blog, we will be building a classifier that will help us to identify whether an incoming message is a spam or not. We will be using both machine learning and neural network approach to accomplish building classifier. If you are directly jumping to this blog then I will recommend you to go through part 1 and part 2 of building SPAM classifier series. Data used can be found here

Assessing the problem

Before jumping to machine learning, we need to identify what do we actually wish to do! We need to build a binary classifier which will look at a text message and will tell us whether that message is a spam or not. So we need to pick up those machine learning models which will help us to perform a classification task! Also note that this problem is a case of binary classification problem, as we have only two output classes into which texts will be classified by our model (0 – Message is not a spam, 1- Message is a spam)

We will build 3 machine learning classifiers namely SVM, KNN, and Naive Bayes! We will be implementing each of them one by one and in the end, have a look at the performance of each

Building an SVM classifier (Support Vector Machine)

A Support Vector Machine (SVM) is a discriminative classifier which separates classes by forming hyperplanes. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space, this hyperplane is a line dividing a plane into two parts wherein each class lay in either side.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30)

Till now, we have trained our model on the training dataset and have evaluated on a test set ( a data which our model has not seen ever). We have also performed a cross-validation over the classifier to make sure over trained model is free from any bias and variance issues!

from sklearn import svm

svm = svm.SVC(kernel='linear').fit(X_train, y_train)

from sklearn.metrics import confusion_matrix
scores = cross_val_score(svm, X_train, y_train, scoring='accuracy', n_jobs=-1, cv=10)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

from sklearn.metrics import confusion_matrix
y_pred_knn = svm.predict(X_test)
confusion_matrix(y_test,y_pred_knn)

## Output
## Cross-validation mean accuracy 97.61%, std 0.85.
## array([[1446,    3],
##        [  19,  204]])

Our SVM model with the linear kernel on this data will have a mean accuracy of 97.61% with 0.85 standard deviations. Cross-validation is important to tune the parameters of the model. In this case, we will select different kernels available with SVM and find out the best working kernel in terms of accuracy. We have reserved a separate test set to measure how well the tuned model is working on the never seen before data points.

Building a KNN classifier (K- nearest neighbor)

K-Nearest Neighbors (KNN) is one of the simplest algorithms which we use in Machine Learning for regression and classification problem. KNN algorithms use data and classify new data points based on similarity measures (e.g. distance function). Classification is done by a majority vote to its neighbors. The data is assigned to the class which has the most number of nearest neighbors. As you increase the number of nearest neighbors, the value of k, accuracy might increase.

Below is the code snippet for  KNN classifier

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3).fit(X_train, y_train)

from sklearn.metrics import confusion_matrix
scores = cross_val_score(knn, X_test, y_test, scoring='accuracy', n_jobs=-1, cv=10)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

from sklearn.metrics import confusion_matrix
y_pred_knn = knn.predict(X_test)
confusion_matrix(y_test,y_pred_knn)

## Output
## Cross-validation mean accuracy 96.83%, std 1.59.
## array([[1449,    0],
##        [ 133,   90]])

Building a Naive Bayes Classifier

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

  • “P” is the symbol to denote probability.
  • P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
  • P(B | A) = The probability of event B (evidence) occurring given that A (hypothesis) has occurred.
  • P(A) = The probability of event B (hypothesis) occurring.
  • P(B) = The probability of event A (evidence) occurring.

Below is the code snippet for multinomial Naive Bayes classifier

from sklearn.naive_bayes import MultinomialNB

mb=MultinomialNB().fit(X_train, y_train)

from sklearn.metrics import confusion_matrix
scores = cross_val_score(mb, X_test, y_test, scoring='accuracy', n_jobs=-1, cv=10)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

from sklearn.metrics import confusion_matrix
y_pred_knn = mb.predict(X_test)
confusion_matrix(y_test,y_pred_knn)

## Output
## Cross-validation mean accuracy 91.15%, std 0.80.
## array([[1449,    0],
##        [  72,  151]])

Evaluating the performance of our 3 classifiers

We have till now implemented 3 classification algorithms for finding out the SPAM messages

  1. SVM (Support Vector Machine)
  2. KNN (K nearest neighbor)
  3. Multinomial Naive Bayes

SVM, with the highest accuracy (97%), looks like the most promising model which will help us to identify SPAM messages. Anyone can say this by just looking at the accuracy right? But this may not be the actual case. In the case of classification problems, accuracy may not be the only metric you may want to have a look at. Feeling confused? I am sure you will be and allow me to introduce you to our friend Confusion Matrix which will eventually sort all your confusion out

Confusion Matrix

A confusion matrix, also known as error matrix, is a table which we use to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

A sample confusion matrix for 2 classes

Definition of the Terms:

• Positive (P): Observation is positive (for example: is a SPAM).
• Negative (N): Observation is not positive (for example: is not a SPAM).
• True Positive (TP): Observation is positive, and the model predicted positive.
• False Negative (FN): Observation is positive, but the model predicted negative.
• True Negative (TN): Observation is negative, and the model predicted negative.
• False Positive (FP): Observation is negative, but the model predicted positive.

Let us bring two other metrics apart from accuracy which will help us to have a better look at our 3 models

Recall:

The recall is the ratio of the total number of correctly classified positive examples divided to the total number of positive examples. High Recall indicates the class is correctly recognized (small number of FN).

Precision:

To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labelled as positive is indeed positive (small number of FP).

Let us have a look at the confusion matrix of our SVM classifier and try to understand it. Consecutively, we will be summarising confusion matrix of all our 3 classifiers

Given below is the confusion matrix of the results which our SVM model has predicted on the test data. Let us find out accuracy, precision and recall in this case.

Accuracy = (1446+204)/(1446+3+19+204) = 1650/1672 = 0.986 i.e 98% Accuracy

Recall = (204)/(204+19) = 204/223 = 0.9147 i.e. 91.47% Recall

Precision = (204)(204+3) = 204/207 = 0.985 i.e 98.5% Precision

Understanding the ROC Curve

In Machine Learning, performance measurement is an essential task. So when it comes to a classification problem, we can count on an AUC – ROC Curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics)

AUC – ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

We plot a ROC curve with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

Plotting RoC curves for SVM classifier

from sklearn import metrics
probs = svm.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Let us have a look at the ROC curve of our SVM classifier

Always remember that the closer AUC (Area under the curve) is to value 1, the better the classification ability of the classifier. Furthermore, let us also have a look at the ROC curve of our KNN and Naive Bayes classifier too!

The graph on the left is for KNN and on the right is for Naive Bayes classifier. This clearly indicates that Naive Bayes classifier, in this case, is much more efficient than our KNN classifier as it has a higher AUC value!

Conclusion

In this series, we looked at understanding NLP from scratch to building our own SPAM classifier over text data. This is an ideal way to start learning NLP as it covers basics of NLP, word embeddings and numeric representations of text data and modeling over those numeric representations. You can also try neural networks for NLP as they are able to achieve good performance! Stay tuned for more on NLP in coming blogs.

 

How to be an Artificial Intelligence (AI) Expert?

How to be an Artificial Intelligence (AI) Expert?

Introduction

Artificial Intelligence is growing at a rapid pace in the last decade. You have seen it all unfold before your eyes. From self-driving cars to Google Brain, artificial intelligence has been at the centre of these amazing huge-impact projects.

Artificial Intelligence (AI) made headlines recently when people started reporting that Alexa was laughing unexpectedly. Those news reports led to the usual jokes about computers taking over the world, but there’s nothing funny about considering AI as a career field. Just the fact that five out of six Americans use AI services in one form or another every day proves that this is a viable career option

Why AI?

Well, there can be many reasons for students selecting this as their career track or professionals changing their career track towards AI. Let us have a look at some of the points on discussing why AI!

  1. Interesting and Exciting
    AI offers applications in those domains which are challenging as well as exciting. Driverless cars, human behaviour prediction, chatbots etc are just a few examples, to begin with.
  2. High Demand and Value
    Lately, there has been a huge demand in the industry for the data scientists and AI specialists which has resulted in more jobs and higher value given at workplace
  3. Well Paid
    With high demand and loads of work to be done, this field is one of the well-paid career choices currently. In the era, when jobs were reducing and the market was saturating, AI has emerged as one of the most well-paid jobs

If you still have thoughts on why one should choose AI as their career then my answer will be as clear as the thought that “If you do not want AI to take your job, you have to take up AI”!

Level 0: Setting up the ground

If maths(too much) does not intimidate and furthermore you love to code, you can then only start looking at AI as your career. If you do enjoy optimizing algorithms and playing with maths or are passionate about it, Kudos! Level 0 is cleared and you are ready to start a career in AI.

Level 1: Treading into AI

At this stage, one should cover the basics first and when I say basics, it does not imply to get the knowledge of 4–5 concepts but indeed a lot of them(Quite a lot of them)

    1. Cover Linear Algebra, Statistics, and ProbabilityMath is the first and foremost thing you need to cover. Start from the basics of math covering vectors, matrices, and their transformations. Then proceed to understand dimensionality, statistics and different statistical tests like z-test, chi-square tests etc. After this, you should focus on the concepts of probability like Bayes Theorem etc.  Maths is the foundation step of understanding and building those complex AI algorithms which are making our life simpler!

 

  1. Select a programming language

    After learning and being profound in the basic maths, you need to select a programing language. I would rather suggest that you take up one or maximum two programming languages and understand it in depth. One can select from R, Python or even JAVA! Always remember, a programing language is just to make your life simpler and is not something which defines you. We can start with Python because it is abstract and provides a lot of libraries to work with. R is also evolving very fast so we can consider that too or else go with JAVA. (Only if we have a good CS background!)
  2. Understand data structures
    Try to understand the data structure i.e. how you can design a system for solving problems involving data. It will help you in designing a system which is accurate and optimized. AI is more about reaching an accurate and optimized result. Learn about the Stacks, linked lists, dictionaries and other data structures that your selected programing language has to offer
  3. Understand Regression in complete detail
    Well, this is one advice you will get from everyone. Regression is the basic implementation of maths which you have learned so far. It depicts how this knowledge can be used to make predictions in real-life applications. Having a strong grasp over regression will help you greatly in understanding the basics of machine learning. This will prepare you well for your AI career.
  4. Move on to understand different Machine Learning models and their working 
    After learning regression, one should get their hands dirty with other legacy machine learning algorithms like Decision Trees, SVM, KNN, Random Forests etc. You should implement them over different problems in day to day life. One should know the working math behind every algorithm. Well, this may initially be little tough, but once you get going everything will fall in its place. Aim to be a master in AI and not just any random practitioner!
  5. Understand the problems that machine learning solves
    You should understand the use cases of different machine learning algorithms. Focus on why a certain algorithm fits one case more than the other. Then only then you will be able to appreciate the mathematical concepts which help in making any algorithm more suitable to a particular business need or a use case. Machine learning is itself divided into 3 broad categories i.e. Supervised Learning, Unsupervised Learning, and Reinforcement Learning. One needs to be better than average in all the 3 cases before you can actually step into the world of Deep Learning!

 

Level 2: Moving deeper into AI

This is level 2 of your journey/struggle to be an AI specialist. At this level, we deal with moving into Deep Learning but only when you have mastered the legacy of machine learning!

  1. Understanding Neural Networks

    A neural network is a type of machine learning which models itself after the human brain. This creates an artificial neural network that via an algorithm allows the computer to learn by incorporating new data. At this stage, you need to start your deep learning by understanding neural networks in great detail. You need to understand how these networks are intelligent and make decisions. Neural nets are the backbone of AI and you need to learn it thoroughly!
  2. Unrolling the maths behind neural networks
    Neural networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ which contain an ‘activation function’. Patterns are presented to the network via the ‘input layer’, which communicates to one or more ‘hidden layers’ where the actual processing is done via a system of weighted ‘connections’. The hidden layers then link to an ‘output layer’ where the answer is output. You need to learn about the maths which happens in the backend of it. Learn about weights, activation functions, loss reduction, backpropagation, gradient descent approach etc. These are some of the basic mathematical keywords used in neural networks. Having a strong knowledge of them will enable you to design your own networks. You will also actually understand from where and how neural network borrows its intelligence! It’s all maths mate.. all maths!
  3. Mastering different types of neural networks
    As we did in ML, that we learned regression first and then moved onto the other ML algos, same is the case here. Since you have learned all about basic neural networks, you are ready to explore the different types of neural networks which are suited for different use cases. Underlying maths may remain the same, the difference may lie in few modifications here and there and pre-processing of the data. Different types of Neural nets include Multilayer perceptrons, Recurrent Neural Nets, Convolutional Neural Nets, LSTMS etc
  4. Understanding AI in different domains like NLP and Intelligent Systems
    With knowledge of different neural networks, you are now better equipped to master the application of these networks to different applications in Business. You may need to build a driverless car module or a human-like chatbot or even an intelligent system which can interact with its surrounding and self-learn to carry out tasks. Different use cases require different approaches and different knowledge. Surely you can not master every field in AI as it is a very large field indeed hence I will suggest you pick up a single field in AI say Natural Language processing and work on getting the depth in that field. Once your knowledge has a good depth, then only you should think of expanding your knowledge across different domains
  5. Getting familiar with the basics of Big Data
    Although, acquiring the knowledge of Big Data is not a mandatory task but I will suggest you equip yourself with basics of Big Data because all your AI systems will be handling Big Data only and it will be a good plus to have basics of Big Data knowledge as it will help you in making more optimized and realistic algorithms

Level 3: Mastering AI

This is the final stage where you have to go all guns blazing and is the point where you need to learn less but apply more whatever you have learned till now!

  1. Mastering Optimisation Techniques
    Level 1 and 2 focus on achieving accuracy in your work but now we have to talk about optimizing it. Deep learning algorithms consume a lot of resources of the system and you need to optimize every part of it. Optimization algorithms help us to minimize (or maximize) an Objective function (another name for Error function) E(x) which is simply a mathematical function dependent on the Model’s internal learnable parameters. The internal parameters of a Model play a very important role in efficiently and effectively training a Model and produce accurate results. This is why we use various Optimization strategies and algorithms to update and calculate appropriate and optimum values of such model’s parameters which influence our Model’s learning process and the output of a Model.
  2. Taking part in competitions
    You should actually take part in hackathons and data science competitions on kaggle as it will enhance your knowledge more and will give you more opportunities to implement your knowledge
  3. Publishing and Reading lot of Research Papers
    Research — Implement — Innovate — Test. Keep repeating this cycle by reading on a lot of research papers related to AI. This will help you in understanding how you can just not be a practitioner but be an thrive to be an innovator. AI is still nascent and needs masters who can innovate and bring revolution to this field
  4. Tweaking maths to roll out your own algorithms
    Innovation needs a lot of research and knowledge. This is the final place where you want yourself to be to actually fiddle with the maths which powers this entire AI. Once you are able to master this art, you will be one step away in bringing a revolution!

Conclusion

Mastering AI is not something one can achieve in a short time. AI requires hard work, persistence, consistency, patience and a lot of knowledge indeed! It may be one of the hottest jobs in the industry currently. Being a practitioner or enthusiast in AI is not difficult but if you are looking at the being a master at this, one has to be as good as those who created it! It takes years and skill to be a master at anything and same is the case with AI. If you are motivated, nothing can stop you in this entire world. ( Not even an AI :P)

Spam Detection with Natural Language Processing-Part 2

Understanding TF-IDF and Word Embeddings

Related image

In the last blog, we had a look over visualizing text data and understood some basic concepts of tokenization and lemmatization. We wrote python function to perform all the operations for us. If you are directly jumping to this blog, I will highly recommend you to go through the previous blog post in which we have discussed the problem statement and some founding concepts of NLP.

We will be covering the following topics

  1. Understanding Tf-IDF
  2. Finding Important words using Tf-IDF
  3. Understanding Bag of Words
  4. Understanding Word Embedding
  5. Different Types of word embeddings
  6. Difference between word embeddings and Bag of words model
  7. Preparing a word embedding for SPAM classifier

Introduction

Previously, we found out the most occurring/common words, bigrams, and trigrams from the messages separately for spam and non-spam messages. Now we need to also find out some important words that can themselves define whether a message is a spam or not. Take a note here that most occurring/common word in a set of messages may not be a keyword that determines what the entire sentence is all about.

For example, in a business article words like business, investment, acquisition are important words that may relate a sentence to a business article. Other words like money, good, building etc may be the frequent words in the messages but they do not have much relevant information to provide.

To find the important words, we will be using the method known as Term Frequency-Inverse Document Frequency (TF-IDF)

What is TF-IDF?

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.

TF means Term Frequency. It measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length as a way of normalization.

TF = (Number of times term w appears in a document) / (Total number of terms in the document)

Second part idf stands for Inverse Document Frequency. It measures how important a term is. While computing TF, all terms are equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones.

IDF =  log_e(Total number of documents / Number of documents with term w in it)

We calculate a final tf-idf score by multiplying TF score with IDF score for every word and then finally, we can filter out important words by selecting words with a higher Tf-Idf score.

Code Implementation

An example to calculate Tf-idf score for different words

Sentences = ["Ironman movie is really good. Ironman is always my favourite", "Titanic movie is very boring","Thor movie is really good"]

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(Sentences)
pd.DataFrame(features.todense(),columns=tfidf.get_feature_names())

Finding Important words using Tf-IDF

Now we will need to find out which are the most important words in both spam and non-spam messages and then we will have a look at those words in the form of the word cloud. We will analyse those words and that will help us to relate why a particular message has been marked as a spam and other as a non-spam message.

First, we import the necessary libraries. Then I have a written a function that returns a TF-IDF score for all words in the corpus

from gensim.models import TfidfModel 
from gensim.corpora import Dictionary
from gensim import corpora
from gensim import models

def get_tfidf_matrix(documents): 
    documents=[my_tokeniser(document) for document in documents]
    dictionary = corpora.Dictionary(documents)
    n_items = len(dictionary)
    corpus = [dictionary.doc2bow(text) for text in documents]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    return corpus_tfidf 

Then we need to map all the scores to the words in the corpus in order to find the most important words

def get_tfidf_score_dataframe(sentiment_label):
    frames = get_tfidf_matrix(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"])
    all_score=[]
    all_words=[]
    sentence_count=0
    for frame in frames:
        words=my_tokeniser(training_dataset[training_dataset["Sentiment"]==sentiment_label]["Phrase"].iloc[sentence_count])
        sentence_count=sentence_count+1
        for i in range(0,len(frame)):
            all_score.append(frame[i])
            all_words.append(words[i])
    tf_idf_frame=pd.DataFrame({
        'Words': all_words,
        'Score': all_score
    })
    count=0
    for key, val in tf_idf_frame["Score"]:
        tf_idf_frame["Score"][count] = val
        count=count+1
    return tf_idf_frame

Finally, we plot all the important words in the form of a word cloud

def plot_tf_idf_wordcloud(sentiment_label):
    tf_idf_frame = get_tfidf_score_dataframe(sentiment_label)
    sorted_tf_idf_frame=tf_idf_frame.sort_values("Score", ascending=False)
    important_negative_words=sorted_tf_idf_frame[sorted_tf_idf_frame["Score"]==1]["Words"].unique()
    comment_words=''
    for words in important_negative_words: 
        comment_words = comment_words + words + ' '
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    stopwords = stopwords, 
                    min_font_size = 10).generate(comment_words)                       
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show()

Plotting Important words for non-spam messages

plot_tf_idf_wordcloud(label=0)


Plotting Important words for non-spam messages

plot_tf_idf_wordcloud(label=1)

Understanding Bag of Words

We need a way to represent text data for the machine learning algorithm and the bag-of-words model helps us to achieve that task. The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

Vocabulary can be attained by tokenising the messages into different unique tokens. After getting each token, we need to score that token. This can be done in the following ways

  • Counts. Count the number of times each word appears in a document.
  • Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
  • TF-IDF : TF score * IDF score

How BoW works

Forming the vector

Take for example 2 text samples: The quick brown fox jumps over the lazy dogand.Never jump over the lazy dog quickly

The corpus(text samples) then form a dictionary:

{
    'brown': 0,
    'dog': 1,
    'fox': 2,
    'jump': 3,
    'jumps': 4,
    'lazy': 5,
    'never': 6,
    'over': 7,
    'quick': 8,
    'quickly': 9,
    'the': 10,
}

Vectors are then formed to represent the count of each word. In this case, each text (i.e. the sentences) will generate a 10-element vector like so:

[1,1,1,0,1,1,0,1,1,0,2]
[0,1,0,1,0,1,1,1,0,1,1]

Each element represents the number of occurrence for each word in the corpus(text sample). So, in the first sentence, there is 1 count for “brown”, 1 count for “dog”, 1 count for “fox” and so on (represented by the first array). Whereas, the vector shows that there is 0 count of “brown”, 1 count for “dog” and 0 counts for “fox”, so on and so forth

Understanding Word Vectors

Word vectors are simply vectors of numbers that represent the meaning of a word.

Traditional approaches to NLP, such as one-hot encodings, do not capture syntactic (structure) and semantic (meaning) relationships across collections of words and, therefore, represent language in a very naive way.

Word vectors represent words as multidimensional continuous floating point numbers where semantically similar words are mapped to proximate points in geometric space. In simpler terms, a word vector is a row of real-valued numbers (as opposed to dummy numbers) where each point captures a dimension of the word’s meaning and where semantically similar words have similar vectors. This means that words such as wheel and engine should have similar word vectors to the word car (because of the similarity of their meanings), whereas the word banana should be quite distant.

A simple representation of word vectors

Now we will look at an example of using word vectors where we will group words of similar semantics together

import numpy as np
import spacy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
nlp = spacy.load("en")
sentence = "Tiger was driving a car when he saw a fox taking the charge on a bike but in the end giraffe won the race using his aircraft"
tokens = nlp(sentence)
vectors = np.vstack([word.vector for word in tokens if word.has_vector])
pca = PCA(n_components=2)
vecs_transformed = pca.fit_transform(vectors)
vecs_transformed = np.c_[sentence.split(), vecs_transformed]
plt.figure(figsize = (16, 10), facecolor = None) 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
d = pd.DataFrame(vecs_transformed)
d.columns=["Name","V1", "V2"]
v1 = [float(x) for x in d['V1']]
v2 = [float(x) for x in d['V2']]
plt.scatter(v1, v2)
for i, txt in enumerate(d['Name']):
plt.annotate(txt, (v1[i], v2[i]))
plt.show()

Preparing a bag of words model for Analysis

Below is the code snippet for converting our messages into a table which has numerical word vectors. After achieving this only, we can build our classifier using machine learning since machine learning always needs numerical inputs!

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectoriser = TfidfVectorizer(decode_error="ignore")
X = vectoriser.fit_transform(list(training_dataset["comment"]))
y = training_dataset["b_labels"]
## Ouput 
print(X)
## <5572x8672 sparse matrix of type '<class 'numpy.float64'>'
##	with 73916 stored elements in Compressed Sparse Row format>

Conclusion and Further steps

Till now we have learnt to perform EDA over text data. We have also learnt about important terms in NLP like tokenization, lemmatization, stop-words, tf-idf, the bag of words, and word-vectors. These terms are essential to master NLP. After having out word embedding ready, we will proceed to actually build machine learning models. They will help us to predict whether a message is a spam or not. In the next blog, we will build machine learning and neural network models and compare their performance. We will understand shortcomings of the neural net in the case of text mining. Finally, we will move to recurrent neural networks and LSTM to wrap up the series!

Click Here for Part 1 of the article.

Stay tuned!