9923170071 / 8108094992 info@dimensionless.in
KNNs (K-Nearest-Neighbours) in Python

KNNs (K-Nearest-Neighbours) in Python

The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.

Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.

You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.

This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence. 

The Nearest Neighbours topic can be divided into the following sub-topics:

  1. Brute-Force Search
  2. KD-Trees
  3. Ball-Trees
  4. K-Nearest Neighbours

Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.

K-Nearest Neighbours (KNNs)

A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/  – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).

Implementation

The algorithm of a KNN ML model is given below:

K-Nearest Neighbours

Again, mathematical! Let’s break it into small steps one at a time:

How the Algorithm Works

This explanation is for supervised learning binary classification.

Here we have two classes. We’ll call them A and B.

So the dataset is a collection of values which belong either to class A or class B.

A visual plot of the (arbitrary) data might look something like this:

Now, look at the star data point in the centre. To which class does it belong? A or B?

The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.

They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).

The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).

Finally, we return a class as output which is closest to the new data point, according to various measures.  The measures used include Euclidean distance among others.

This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.

Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.

Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on.  Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.

Obtain the Data and Preprocess it

We shall use the data from the UCI Repository, available at the following link:

http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) 

It needs to be manually converted into a CSV file, which is available at the following link:

https://github.com/selva86/datasets/blob/master/Sonar.csv

This data is a  set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:

Import the required packages first:

import numpy as np
import pandas as pd
import scipy as sp


from datetime import datetime
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics.classification import accuracy_score
from sklearn.metrics.classification import confusion_matrix
from sklearn.metrics.classification import classification_report 

Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function. 

> df = pd.read_csv("sonar.all-data.csv")
df.head()
0.0200  0.0371  0.0428  0.0207  0.0954 ...  0.0180  0.0084  0.0090  0.0032  R
0  0.0453  0.0523  0.0843  0.0689  0.1183 ...  0.0140  0.0049  0.0052  0.0044  R
1  0.0262  0.0582  0.1099  0.1083  0.0974 ...  0.0316  0.0164  0.0095  0.0078  R
2  0.0100  0.0171  0.0623  0.0205  0.0205 ...  0.0050  0.0044  0.0040  0.0117  R
3  0.0762  0.0666  0.0481  0.0394  0.0590 ...  0.0072  0.0048  0.0107  0.0094  R
4  0.0286  0.0453  0.0277  0.0174  0.0384 ...  0.0057  0.0027  0.0051  0.0062

Sonar Reading for Classification ML Problem

Check how much data you have and what its dimensions are:

> df.describe()
0.0200      0.0371     ...          0.0090      0.0032
count  207.000000  207.000000     ...      207.000000  207.000000
mean     0.029208    0.038443     ...        0.007936    0.006523
std      0.023038    0.033040     ...        0.006196    0.005038
min      0.001500    0.000600     ...        0.000100    0.000600
25%      0.013300    0.016400     ...        0.003650    0.003100
50%      0.022800    0.030800     ...        0.006300    0.005300
75%      0.035800    0.048100     ...        0.010350    0.008550
max      0.137100    0.233900     ...        0.036400    0.043900
> df.shape
(207, 61)

Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:

#Inputs (data values) sonar readings from an underground submarine. Cool!
X = df.values[:,0:-1].astype(float)

# Convert classes M (Mine) and R to numbers, since they're categorical values
le = LabelEncoder()

#Classification target
target = df.R
# Do conversion
le = LabelEncoder.fit(le, y = ["R", "M"])
y = le.transform(target)
› 

Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.

y
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

Examine your  pre-scaled input numpy array, X:

X
array([[0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044],
       [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078],
       [0.01  , 0.0171, 0.0623, ..., 0.0044, 0.004 , 0.0117],
       ...,
       [0.0522, 0.0437, 0.018 , ..., 0.0138, 0.0077, 0.0031],
       [0.0303, 0.0353, 0.049 , ..., 0.0079, 0.0036, 0.0048],
       [0.026 , 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]])

Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into  the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Fit the KNN classifier to the dataset.

#Train kneighbors classifier
from sklearn.neighbors.classification import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 5, metric = "minkowski", p = 1)

# Fit the model
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=1,
           weights='uniform')

As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 – x2| + |y1 – y2|. (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)

Output the statistical scoring of this classification model.

predicted = clf.predict(X_test)
print("Accuracy:")
print(accuracy_score(y_test, predicted))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predicted))
print("Classification Report:")
print(classification_report(y_test, predicted))
Accuracy:
0.8253968253968254
Confusion Matrix:
[[32  3]
 [ 8 20]]
Classification Report:
             precision    recall  f1-score   support

          0       0.80      0.91      0.85        35
          1       0.87      0.71      0.78        28

avg / total       0.83      0.83      0.82        63

Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!

The entire program as source code  in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):

[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/11/sonar.txt” download=”all” viewer=”google”]

To learn more about how k-nearest neighbours are used in practice, do check out the following excellent article on our blog:

https://dimensionless.in/spam-detection-with-natural-language-processing-part-3/ 

The following article is also an excellent reference for KNNs:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm 

Takeaways

K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.

Finally, Understand the Power of Machine Learning

Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!

Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.

And this is just the beginning!

Automation is the Future

As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already.  Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.

Data Science in Esports

Data Science in Esports

Introduction

Electronic video gaming has extended from being a hobby into a serious sport and business. Earlier this year, eSports officially became a medal event in the 2022 Asian Games. According to data analytics expert Andrew Pearson, the rise of eSports presents exciting opportunities in data analytics and marketing.

There’s been an explosive growth in esports popularity over recent years, fuelled by games specifically designed with online competition in mind. Blizzard’s Overwatch is a case in point. When the Overwatch League debuted in January 2018, 415,000 viewers tuned in to watch.

The stakes are high. Each team in the Overwatch League stumped up $20 million (£14.4 million) for a city franchise. Participating gamers enjoy $50,000 (£36,000) salaries while competing for a prize pool totaling a cool $3.5 million (£2.5 million).

AI and ESPORTS

Just as data analytics is helping golfers, athletes, F1 teams, football clubs and cricketers improve their performance, esports is well-placed to follow suit. As with any sport, winning doesn’t just hinge on skill, dedication and luck. It’s often determined by strategy and the analysis of past performance. The secrets to success lie in data and esports is overflowing with it.

We can divide the idea of AI and Esports into 3 different aspects or perspectives. It can be as AI playing gaming sports themselves, game analytics platforms to provide the insights and details about the players and their gaming behavior and tactics and lastly, data science in the gaming industry to manage the business side of the games as products

AI & esports

Let us have a look at each of the aspects and discuss them in detail one by one!

ESPORTS by AI

Gamers have been pitting their wits and skill against computers since the earliest days of video games. The levels of difficulty were pre-programmed, and at a certain point in the game, the computer was simply unbeatable by all but the most gifted gamers.

Over time, the concept of difficulty levels evolved. For example, “Madden” NFL Football games have four different levels (ranging from Rookie to All-Madden) that make running plays more difficult, while first-person shooter (FPS) games like “Duke Nukem 3D” follow the same type of tiered difficulty (ranging from Piece of Cake to Damn, I’m Good) that makes it tougher to stay alive and kill enemies.

The rise of machine learning combined with the increasing popularity of esports (organized, multiplayer video game competitions that feature professional video gamers gaming against each other with millions of dollars of prize money on the line), may inextricably link AI to gaming and esports.

That said, the most common implementation of AI in esports is in the games themselves. Companies like AI Gaming want to develop smarter AI bots that would compete against each other in an effort to grow smarter and more competitive, while OpenAI, a research lab co-founded by Elon Musk, developed AI that can beat the top 1% of Dota 2 amateurs (though the AI lost a best-of-three match to some of the game’s best players in August 2018).

  • DeepMind’s AlphaGo used some surprising tactics while playing against Lee Sedol. People thought these wouldn’t actually work, but they did. A similar discovery of new tactics and strategies can happen in eSports. Players will have to re-think their every move while playing. Situations like these will give them insight that might not have been possible.
  • CSGO has a ‘6th man’ setup where an observer advises the players on their strategies. A bot can instead replace the ‘6th man’, a form of ‘Augmented advising’. Teams will have to augment the bot’s recommendations into their gameplay. Teams who do this well will be the winners. Since a lot of machine learning algorithms are democratized there won’t be a situation where teams are unfairly matched.
  • Like I mentioned earlier StarCraft II is a game with quite a bit of strategic depth. This also makes the game more difficult for new-comers. The presence of an in-game coach would be helpful. It would speed up the process of getting started on the game and decrease the learning time.

ESports at the end of the day is a form of sports. People tune in from all over the world to watch their favourite teams play and cheer for them. Only this time, they’ll be rooting for 5 players and a bot.

Game analytics platforms

game analytics platforms
Shadow GG

Shadow.gg — a Counter-Strike analytics platform that its creators claim will cause a significant leap in how esports professionals currently approach preparing for a match by giving fast and easy access to a large number of in-game statistics for any match in the platform’s library. Built primarily for teams looking for a competitive edge, the tool aims to help scout opponents, quickly view data, and visualize that data in meaningful ways. It lessens the burden on coaches and analysts to scout demos and lets your coaches, analysts, and players focus on only important rounds.

The core value proposition here is that coaches and players that use this tool will be able to arrive at conclusions about their opponents’ play, and their own play, that is either too time intensive to arrive at through basic demo review, or simply can’t be reasoned about by trying to estimate data from observing matchplay. We can begin identifying trends for teams and players with regards to their tendencies in relation to the economic context, or how they utilize grenades, or how they prefer to retake a particular site, to name a few.

AI in shadow GG

Obviously players still have to hit their shots in-game, and that’s on them; but going into a match armed with detailed information about which way your opponent leans in crucial situations could mean the difference between a comfortable win or a 16:14 loss; so we hope the value of the tool today and where we plan to take it over the next year will become rather self-evident.

NXTAKE — Advanced CSGO Analytics
NXtake

Built by former daily fantasy sports professionals, NXTAKE is a leader in esports analytics and broadcast augmentation. Our company specializes in advanced analytics, data feeds, and esports prediction models. NXTAKE combines big data and simulation to bring next level analytics to the world of esports. Together, we have wagered enormous amounts of capital over the past few years and are sharing our expertise in a new and exciting industry. It can provide real-time analysis, coupled with live streams

Data science in esports

data science in esports | Dimensionless

Targetting new gamers using data analytics

One of the best examples of data science in this area, customer “segmentation.”

This is a HIGHLY desired function within digital marketing because it’s the analysis of your existing and potential markets in an effort to better understand customers.

Doing this exercise, you can take in vast amounts of data from dozens of data sources (web, social, email, forums, media listening, etc) and feed it into statistical models to extract customer segments, like “your potential target market for your new game consists of those that classify themselves as hardcore gamers between the ages of 14 and 31 years old, that play RPGs like Skyrim, and that average a GTX 1070 GPU.”

What you can then do with this information, is to apply that segment to paid advertising strategies. So, when you start the pre-order push, you can make sure that your digital ads are targeted at the people that you were able to isolate to that segment, and not blasted to that CS: GO player that doesn’t like RPGs.

Competitive game pricing

The goal of an effective BI system in the gaming industry must be able gathering gamer data from several types of external sources, and comparing that data with data in internal systems to arrive at conclusive decisions about a customer’s spending pattern, tastes and levels of satisfaction. A large part of the data analyzed in this case may large volumes of unstructured, social-media data.

Improving gameplay experience

Insights from gaming analytics also enable companies to improve the gameplay itself. For example, millions of player records could be analyzed to pinpoint the most likely in-game moments when players quit the game entirely; perhaps a series of quests are too boring or the challenges are too hard/easy based on character level. Identifying these gaming “bottlenecks” is critical to understanding the reasoning & timing behind a game’s churn rate. Gaming Designers and Developers can then re-examine the game’s storylines, quests, and challenges in order to refine the gaming experience and, hopefully, reduce the number of lost subscribers.

Analyzing the devices used by players also helps developers to create gaming experiences that work effectively for their user base. Exploring a dungeon via an iPhone is quite different than doing it using a widescreen attached to a laptop, so developers need to address issues such as screen size, available functionality, navigation, and character interactions. Data analytics empowers companies to address this challenge by modeling and visualizing massive amounts of heterogeneous data.

Game analytics to improve gaming infrastructure

Today, games sometimes have global player bases… so the architecture supporting those users needs to be configured and implemented correctly. Online games are particularly prone to network-related metrics, such as ping and lag rates — these issues are exacerbated during peak gaming times. Again, Big Data analytics enables gaming companies to use server and network data to understand exactly when, and how, their infrastructure is being pushed to its limits. This knowledge enables companies to scale up or down according to player need; in today’s world of cloud-based PaaS/IaaS architectures (where cost is tied to usage), this information can have a dramatic impact on a company’s bottom line.

Analysing competitors

Make a list of games that are using the same theme and some (or all) of your core mechanics. Both released and upcoming. Especially upcoming, because chances are you’ll be judged against them.

Make a basic SWOT analysis for every one of them, but also add an additional field: “How is our game different?” The key word here is “different”, not “better”, so you won’t get caught in wishful thinking “we’ll have better graphics and better balance”. Why your target audience should consider your game instead of another one? People only have so much time to play.

You can also check geographical distribution and stats for released games on Steam Spy or AppAnnie, but, frankly, it’s not that useful at this stage. You’ll look into it later when deciding on focusing your marketing and localization efforts.

If you’ll decide to check geo distribution for similar games, don’t trust it too much — an audience research you did previously will be more helpful. Other games might’ve done something specific to become popular in some countries, like partnering with a local publisher or getting a good video from a local YouTube celebrity.

For example, there aren’t many owners of The Witcher 3 from Poland on Steam despite the game being immensely popular in that country. That’s because most Poles bought the game from CDP.pl or GOG.com instead of going for much more expensive Steam version.

Conclusion

The gaming industry has a long way to go when we talk about the application of full-fledged data science in its applications or AI bots beating world class players in the complex games like counter strike and DOTA. In this blog too, we looked at how different aspects of data sciences are applied in the gaming industry. But what is clear at this point is the power of AI and the myriad companies looking to harness the same. Gaming appears to be poised as a sector ripe for this type of disruption and companies are getting in early to explore the types of ways to profit off of connecting AI developments with esports.

Stay tuned for more blogs!

Top 5 Trends in Data Science

Top 5 Trends in Data Science

Top 5 trends in Data Science

Introduction

Data Science is a study which deals with the identification, representation, and extraction of meaningful information from data. It can be collected from different sources to be used for business purposes.

With an enormous amount of facts generating each minute, the requirement to extract the useful insights is a must for the businesses. It helps them stand out from the crowd. Data engineers set up the data storage in order to facilitate the process of data mining, data munging activities. Every other organization is running behind profits. But the companies that formulate effective strategies based on insights always win the game in the long-run.

In this blog, we will be discussing new advancements or trends in the data science industry. Consecutively, these advancements are enabling it to tackle some of the trickiest problems across various businesses.

Top 5 Trends

Analytics and associated data technologies have emerged as core business disruptors in the digital age. As companies began the shift from being data-generating to data-powered organizations in 2017, data and analytics became the centre of gravity for many enterprises. In 2018, these technologies need to start delivering value. Here are the approaches, roles, and concerns that will drive data analytics strategies in the year ahead.

The Data Science Trends for 2018 are largely a continuation of some of the biggest trends of 2017 including Big Data, Artificial Intelligence (AI), Machine Learning (ML), along with some newer technologies like Blockchain, Serverless Computing, Augment Reality, and others that employ various practices and techniques within the Data Science industry.

top 5 data science trends

If I am to pick top 5 data science trends right now (which can be very subjective but I will try it to justify the most), I will list them as

  1. Artificial Intelligence
  2. Cloud Services
  3. AR/VR Systems
  4. IoT Platforms
  5. Big Data

Let us understand each of them in bit more detail!

Artificial Intelligence

Artificial intelligence (AI) is not new. It has been around for decades. However, due to greater processing speeds and access to vast amounts of rich data, AI is beginning to take root in our everyday lives.

From natural language generation and voice or image recognition to predictive analytics, machine learning, and driverless cars, AI systems have applications in many areas. These technologies are critical to bringing about innovation, providing new business opportunities and reshaping the way companies operate.

types of Ai

Artificial Intelligence is itself a very broad area to explore and study. But there are some components within artificial intelligence which are making quite a buzz around with their applications across business lines. Let us have a look at them one by one.

  1. Natural language Processing

    With advances in computational power and the integration of artificial intelligence, the natural language processing domain has evolved into a whirlwind of innovation. In fact, experts expect the NLP market to swell to an impressive $22.3 billion by 2025. One of the many applications of NLP in business is chatbots. Chatbots demonstrate utility in the customer service realm. These automated helpers can take care of simple frequently asked questions and other lookup tasks. This leaves customer service agents free to devote time to troubleshooting bigger matters that personalize and enhance the customer experience. Chatbots can save valuable time and energy for all members of the value stream. Chatbot technology is poised for considerable growth as speech and language processing tools become more robust by expanding beyond rules-based engines to include neural conversational models.

  2. Deep Learning

    You might think that Deep Learning sounds a lot like Artificial Intelligence, and that’s true to a point. Artificial Intelligence is a machine developed with the capability for intelligent thinking. Deep Learning, on the other hand, is an approach to Machine Learning which involves Artificial Neural Networks to work with the data. Today, there are more Deep Learning business applications than ever. In different cases, it can be the core offering of the product, such as self-driving cars. Over the past few years. It is found powering some of the world’s most powerful tech today: everything from entertainment media to self-driving cars. Some of the applications of deep learning in business include recommender systems, self-driving cars, image detection, and object classification.

  3. Reinforcement Learning

    The reinforcement learning model prophesies interaction between two elements — Environment and the learning agent. The learning agent leverages two mechanisms namely exploration and exploitation. When the learning agent acts on trial and error, it is termed as exploration, and when it acts based on the knowledge gained from the environment, it is referred to as exploitation. The environment rewards the agent for corrective actions, which is the reinforcement signal. Leveraging the rewards obtained, the agent improves its environment knowledge to select the next action. Now, artificial agents are being created to perform the tasks as a human. These agents have made their presence felt in businesses, and the use of agents driven by reinforcement learning is cut across industries. Some of the practical applications of reinforcement learning include robots driven in the factory, space management in warehouses, dynamic pricing agents, and driving financial investment decisions.

Cloud Services

The complexity in data science is increasing by the day. This complexity is driven by fundamental factors like increased data generation, low-cost storage, and cheap computational power. So, in summary, we are generating far more data, we can store it at a low cost and can run computations and simulations on this data at a low cost!

To tackle the increasing complexities in data science here is why we need cloud services

  1. Need to run scalable data science
  2. Cost
  3. The larger ecosystem for machine learning system deployments
  4. Use for building quick prototypes
cloud service providers

In the field of cloud services, we have 3 major players in this field leading the pack. AWS(Amazon), Azure(Microsoft), GCP(Google).

Augmented Reality/Virtual Reality Systems

The Immersive Experience related to augmented reality (AR) and virtual reality (VR) is already changing the world around us. The human-machine interaction will improve as research breakthroughs in AR and VR come about. It is a claim made in a Gartner report, Augmented Analytics is the future of Data and Analytics, published in July 2017. Augmented analytics automates data insights through machine learning and natural language processing, enabling analysts to find patterns and prepare smart data that can be easily shared and operationalized. Accessible augmented analytics produces citizen data scientists and make an organization more agile.

IoT Platforms

Internet of things refers to a network of objects, each of which has a unique IP address & can connect to the internet. These objects can be people, animal and day to day devices like your refrigerator and your coffee machine. These objects can connect to the internet (and to each other) and communicate with each other through this net, in ways which have not been thought before. The data from current IoT pilot rollouts (sensors, smart meters, etc) will be used to make smart decisions using predictive analytics. E.g., forecast electricity usage from each smart meter to better plan distribution; forecast power output of each wind turbine in a wind farm; predictive maintenance of machines, etc.

The power of Big Data 

Big data is a term to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with.

power of big data

It was a significant trend in data science in 2017 but lately, there have been some advancements in Big data lately which has made it a trend in 2018 too. Let us have a look at some of them

  1. Block Chain

    Data science is a central part of virtually everything — from business administration to running local and national governments. At its core, the subject aims at harvesting and managing data so organizations can run smoothly. For some time now, data scientists have been unable to share, secure and authenticate data integrity. Thanks to bitcoin being overly hyped, the blockchain, the technology that underpins it, got the attentive eyes of data specialists. Blockchain Improves data integrity, provides easy and trusted means of data sharing capabilities, enable real-time analysis and data traceability. With robust security and transparent record keeping, blockchain is set to help data scientists achieve many milestones that were previously considered impossible. Although the decentralized digital ledgers are still a novice technology, the preliminary results from companies experimenting on them, like IBM and Walmart, prove that they work.

  2. Handling Datastreams

    Stream Processing is a Big data technology. It enables users to query continuous data stream and detect conditions fast within a small time period from the time of receiving the data. The detection time period may vary from few milliseconds to minutes. For example, with stream processing, you can receive an alert by querying a data streams coming from a temperature sensor and detecting when the temperature has reached the freezing point. Streaming data possesses immense capabilities which makes it a running trend in Big data till date.

  3. Apache Spark

    Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. With apache releasing new features time by time to its spark library (Spark streaming, GraphX etc), it has been able to maintain its hold as a trend in Big Data till data

Conclusion

This is only the beginning, as data science continues to serve as the catalyst in the changes you are going to experience in business and technology. It is now up to you on how to efficiently adapt to these changes and help your own business flourish.

Stay tuned for more blogs!

Top Data Science Hacks

Top Data Science Hacks

A data science project requires numerous iterations that are time-consuming. When dealing with numbers and data interpretations, it goes without question that you have to be quite smart and proactive.

It’s not surprising that iterations can be frustrating if they require regular updates. Sometimes the model is six months old that needs current information or other times you miss out on some data, so the analysis has to be done all over again. In this article, we will focus on the ways by which the Data Scientists and Business Analysts can increase productivity without spending so much time on unhealthy iterations.

Tips and Tricks for data scientists

Keeping the bigger picture in mind

Long-term goals should be considered a priority when doing the analysis. There could be many small issues rising up but that shouldn’t outcast the bigger ones. Be observant in deciding the problems that are going to affect the organization on a larger scale. Focus on those bigger problems and look for stable solutions. A data Scientists and Business analysts have to be visionary to manifest solutions.

Understanding the problem and keeping the requirements at hand

Data science is not about implementing a fancy/complex algorithm or doing some complex data aggregation. Data science is more about providing a solution to the problem at hand. All the tools like ML, visualization or optimization algorithms are just meant through which one can arrive at a suitable solution. Always understand the problem you are trying to solve. One should not jump directly to machine learning or statistic right after getting the data. We should analyze what data is about and what all you need to know and perform to come to the solution of your problem. Also, it is important to always keep an eye of the feasibility of the solution in terms of implementation. A good solution is always the one which is easily implementable. Always know what all you need to achieve a solution to the problems.

More real-world oriented approach

Data science involves providing a solution to real-world use cases. Hence one should always keep a real-world oriented approach. One should always focus on the domain/business use case of the problem at hand and the solution to be implemented rather than just purely looking at it from the technical side. Technical aspect focusses on the correctness of the solution but the business aspect focusses on the implementation and usage aspect of the solution. Sometimes you may not need a complex incomprehensive algorithm to meet your requirements rather you are happier with a simple algorithm which may not give as a correct result as previous one but its accuracy can be traded with its comprehensible attribute. Knowledge of technical aspect is a must but

Not everything is ML

Recently, machine learning has seen a great advancement in its application in various business applications. With great prediction capabilities, machine learning can solve many of the complex problems in various business scenarios. But one should not that, data science is not about only machine learning. Machine learning is just a small part of it. Data science is more about arriving at a feasible solution for a given problem. One should focus on areas like data cleaning, data visualization, and ability to extensively explore the data and find relations between the various attributes. It is about the ability to crunch out meaningful numbers which matter the most. A good data scientist focusses more on all the above qualities rather than just trying to fit machine learning algorithms on the problem statements

Programming Languages

It is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.

Data cleaning and EDA

Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand — things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method. Let us say you are cleaning data for language processing tasks, and simple models might give you the best result. Cleaning is one of the most complex processes in data science, since almost every data available or extracted for language processing tasks is unstructured. It is a fact that a highly processed and neatly structured data will yield better results than a noisy one. We should rather try to perform cleaning task with simple regular expressions rather than using complex tools

Always open to learning more and more

“Data Science is a journey, not a destination”. This line gives us an insight into how huge the data science domain is and why constant learning is as important as build intelligent models. Practitioners who keep themselves updated with the new tech being developed every day, are able to implement and solve business problems faster.  With all the resources available on the internet like MOOCs, one can easily make use of these to be updated. Also showcasing your skill on your blog or Github is an important hack which most of us are unaware of. This not only benefits their “The man who is too old to learn was probably always too old to learn.”

Evaluating Models and avoiding overfit

Separate the data into two sets ౼ the training set and the testing set to get a stronger prediction of an outcome. Cross-validation is the most convenient method to analyze numerical data without over-fitting. It examines the out-of-sample fit.

Converting findings into the actions

Again, this might sound like a simple tip – but you see both the beginners as well as the advanced people falter on it. The beginners would perform steps in excel, which would include copy paste of data. For the advanced users, any work done through command line interface might not be reproducible. Similarly, you need to extra cautious while working with notebooks. You should control your urge to go back and change any previous step which uses the dataset which has been computed later in the flow. Notebooks are very powerful to maintain a flow. If we do not maintain the flow, it can be very tardy as well.

Taking Rest

When do I work the best? It’s when I provide myself a 2–3 hours window to work on a problem/project. You can’t multi-task as a data scientist. You need to focus on a single problem at a time to make sure you get the best out of yourself. 2– 3-hour chunks work best for me, but you can decide yours.

Conclusion

Data science requires continuous learning and it is more of a journey rather than a destination. One always keep learning more and more about data science hence one should always keep above tricks and tips in his/her arsenal to boost up the productivity of their own self and are able to deliver more value to complex problems which can be solved with simple solutions! Stay tuned for more articles on data science.

Data Science Interview Questions with Answers

Expertise Critical for Every Data Scientist

[embeddoc url=”https://dimensionless.in/wp-content/uploads/2018/10/Data-Science-topics.pdf” viewer=”google”]

 

The Best Way to Prepare for Interview Questions

Now suppose you read a question about a topic like overfitting. You can read the text and memorize the answer. Usually, articles with this heading (Interview Questions and Answers) are normally constructed that way, with plain text questions and answers. You could follow that route for interview preparation, but it is simply not the right thing to do. I can give you a list of important questions, with answers. Which is exactly what I will do in this article, later.

But you need to understand one thing clearly.

You cannot learn programming and data science from books alone.

You can learn the heading and the words. But the concept will truly be understood only in a practical manner; in a mini-project or in a worked-out example on the computer.

Data science is similar to programming in this regard.

Books are meant to just start your journey.

The real learning begins only when you implement it in code by yourself.

To take an example:

Question from the Interviewer:

“What is cross-validation and why is it important? How does it eliminate overfitting?”

A Good Answer:

“Cross-validation eliminates overfitting by exposing the model to the entire data set in a statistically uniform manner. Overfitting happens when the training set and test sets are not properly selected. If a model like LogisticRegression is trained until the error rate is very small, it may not be able to generalize to the pattern of data found in the test set. Hence the performance of the model would be excellent on the training set, but poor on the test set. This is because the model has overfitted itself to the training data. Thus, when presented with test data, error values increase because the generalization capacity of the model has been decreased and the model cannot discover the patterns of the test data.”

“K-fold Cross Validation prevents this by first dividing the total data into k sections and using one section as the test set and the remaining sections as the training set. We train k models, each time using a different fold as the test set and the remaining folds as the training set. Thus, we cover as many combinations of the training and test set as possible as input data. Finally, we take an average of the results of each model and return that as the output. So, overfitting is eliminated by using the entire data as input, one section (one of the k folds) being left out at a time to use as a test set. A common value for k is 10.”

Question:

“Can you show me how that works by coding it on a 10 by 10 array of integers? In Python?”

Worst Case Answer:

“Ummmmmmmm…..”

 “Sorry sir, I just studied that in a textbook. I am not sure how I could work through that by code.”

(!!!)

 

You Can’t Study Without Implementation

Data science should be studied in the way programming is studied. By working at it on a computer and running all the models in your textbook, and finally, doing your own mini-project, on every topic that could be important. Can you learn to drive a car by reading about it in a book? You need practical experience! Otherwise, all your preparation is meaningless. That is the point I wanted to make.

Now, having established this, I assume from here on that you are a data scientist in training who has worked the fundamental details on a computer and is familiar with the basics. You just need the finishing touches on your interview preparation. If that is the case; here are your topics for mini-projects and experiments! And – interview questions with answers.

Interview Practice Resources

Python Practice

https://www.testdome.com/d/python-interview-questions/9

This is a site that allows you to sharpen your skills in Python for interviews. There are many more sites like these, all you need to do is Google ‘Python Interview Questions’.

R Practice

https://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html

Many people know Python, but R is not as commonly known. The above tutorial spans 30 pages that you can work through with your R console to learn the basics. Alternatively, you could try Swirl (link given below), which is also highly recommended for beginners.

https://swirlstats.com/ 

Kaggle

Work through Kaggle competitions. No better way to establish yourself in the data science universe.

https://www.kaggle.com/competitions

 

Also, if you have basic data science skills, try your hand with the hands-on Kernels section. Cash prizes awarded every week!

https://www.kaggle.com/kernels

 

Oh, what are kernels? Kaggle Kernels are online Jupyter notebooks that allow you to run Python and R code interactively with your browser in the same application without any local processing. All computation is done on the Kaggle servers.

Top Ten Essential Data Science Questions with Answers

1. What is a normal distribution? And how is it significant in data science?

The normal distribution is a probability distribution, characterized by its mean and standard deviation or variance. The normal distribution with a mean of 0 and a variance of 1 looks like a bell, hence it is also referred to as the bell curve. The central limit theorem makes the normal distribution ubiquitous in data science. In its essence, the central limit theorem states that data values tend to be attracted to the normal distribution shape as the number of samples is increased without limit. This theorem is used in data science nearly everywhere, because it gives you an ‘expected’ value for an arbitrary dataset that has, say, n = one thousand samples. As n increases, if the data is normally distributed, the shape of the graph of that attribute will tend to look like the bell curve.

2. What do you mean by A/B testing?

An A/B test records the results of two random variables or hypotheses (depending upon the scenario) and compares the rate of success or accuracy for the variable being in the state of A or the state of B. This often tells us which feature should be used to build a machine learning model. It is also used to select which model to use in the first place. A/B testing is a general concept that can be applied to nearly every system.

3. What are eigenvalues and eigenvectors?

The eigenvectors of a matrix that is non-singular (determinant not = 0) are the values associated with linear transformations of that matrix. They are calculated using the correlation or covariance matrix functions. The eigenvalues are the values associated with the strength or the degree of a linear transformation (such as bending or rotating). See Linear Algebra by Gilbert Strang (online ebook) for more details on their computation.

4. How do the recommender systems in Amazon and Netflix work? (research paper pdf)

Recommender systems in Amazon and Netflix are considered top-secret and are usually described as black boxes. But their internal mechanism has been partially worked out by researchers. A recommender system, predated by expert systems models in the 90s, is used to generate rules or ‘explanations’ as to why a product might be more attractive to user X than user Y. Complex algorithms are used, which have many inputs, such as past history genre, to generate the following types of explanations: functional, intentional, scientific and causal. These explanations, which can also be called user-invoked, automatic or intelligent, are tuned by certain metrics such as user satisfaction, user rating, trust, reliability, effectiveness, persuasiveness etc. The exact algorithm still remains an industry secret, similar to the way that Google keeps the algorithms that perform PageRank secret and constantly updated (500-600 times a year in the case of Google).

5. What is the probability of an impossible event, a past event and what is the range of a probability value?

An impossible event E has P(E) = 0. Probabilities take on values only in the closed interval [0, 1]. The probability of event that is from the past is an event that has already occurred and here P(E) = 1.

6. How do we treat missing values in datasets?

A categorical missing value is given its default value. A continuous missing value is usually assigned using the normal distribution, or the measures of central tendency like mean, median and mode. If a feature has less than 20% available data, the recommendation is to delete that feature from the model.

7. Which is faster, Python or R?

Python is considered to be moderately medium-paced since C++ is much faster for all purposes. Besides which, Python is an interpreted and not a compiled language. Python language is implemented in C to speed up execution time. R, however, was designed by statisticians, not computer scientists, and is much slower than Python.

8. What is Deep Learning and why is it such a popular buzzword in the machine learning field right now?

For many years, until around 2006, backpropagation neural networks had just three layers – one input, one hidden and one output layer. The problem with this model was that since it used gradient descent and the backpropagation algorithm, the neural nets had a tendency to be attracted towards the local minima in the hyperplane that represented the dimensions of the input features. Thus, NNs could not be used for many applications optimally, since they could only find a partially optimal solution. In 2006, Geoffrey Hinton et. al. published a research paper that showed that multilayer neural networks could overcome the problem of local minima since, in thousands of dimensions, local minima are statistically so rare as to never be found in the back-propagation process (saddle points are common instead). Deep learning refers to neural nets with 3 or more (even 10) hidden layers. They require more computational power and were one of the reasons that GPUs started to be used by the machine learning community for implementation of deep learnings NNs. Since 2010-2012, deep learning has been applied to nearly every single technology domain, and the models have been highly accurate and successful in all areas from speech recognition to playing the Japanese game of Go.

9. What is the difference between machine learning and deep learning?

For more details on that, I suggest you go through this excellent article, given on the following link on our blog below:

https://dimensionless.in/machine-learning-and-deep-learning-differences/

10. What is Reinforcement Learning?

For an excellent explanation of reinforcement learning that is both educational and fun to read, please visit the following page, also on our blog :

https://dimensionless.in/reinforcement-learning-super-mario-alphago/

Enjoy Your Work!

To finally sum up, I have to say, enjoy your work. You will be much better at what you love than something that is glamorous but not to your taste. Artificial Intelligence, Data Science, Software Development and Machine Learning are very much in my preferred line of work, and my hope is, that it will be in yours too. Don’t just read the text, work out the code on your systems or on Kaggle. That is how to best prepare for interview questions. Only practice at your computer (preferably on Kaggle) will give you true confidence on the day of your interview. That is true expertise – practice making perfect. Enjoy data science!