The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.
Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.
You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.
This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence.
The Nearest Neighbours topic can be divided into the following sub-topics:
Brute-Force Search
KD-Trees
Ball-Trees
K-Nearest Neighbours
Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.
K-Nearest Neighbours (KNNs)
A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/ – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).
Implementation
The algorithm of a KNN ML model is given below:
K-Nearest Neighbours
Again, mathematical! Let’s break it into small steps one at a time:
How the Algorithm Works
This explanation is for supervised learning binary classification.
Here we have two classes. We’ll call them A and B.
So the dataset is a collection of values which belong either to class A or class B.
A visual plot of the (arbitrary) data might look something like this:
Now, look at the star data point in the centre. To which class does it belong? A or B?
The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.
They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).
The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).
Finally, we return a class as output which is closest to the new data point, according to various measures. The measures used include Euclidean distance among others.
This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.
Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.
Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on. Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.
Obtain the Data and Preprocess it
We shall use the data from the UCI Repository, available at the following link:
This data is a set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:
Import the required packages first:
import numpy as np
import pandas as pd
import scipy as sp
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.classification import accuracy_score
from sklearn.metrics.classification import confusion_matrix
from sklearn.metrics.classification import classification_report
Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function.
Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:
#Inputs (data values) sonar readings from an underground submarine. Cool!
X = df.values[:,0:-1].astype(float)
# Convert classes M (Mine) and R to numbers, since they're categorical values
le = LabelEncoder()
#Classification target
target = df.R
# Do conversion
le = LabelEncoder.fit(le, y = ["R", "M"])
y = le.transform(target)
›
Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.
Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Fit the KNN classifier to the dataset.
#Train kneighbors classifier
from sklearn.neighbors.classification import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 5, metric = "minkowski", p = 1)
# Fit the model
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=1,
weights='uniform')
As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 – x2| + |y1 – y2|. (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)
Output the statistical scoring of this classification model.
Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!
The entire program as source code in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):
K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.
Finally, Understand the Power of Machine Learning
Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!
Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.
And this is just the beginning!
Automation is the Future
As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already. Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.
Electronic video gaming has extended from being a hobby into a serious sport and business. Earlier this year, eSports officially became a medal event in the 2022 Asian Games. According to data analytics expert Andrew Pearson, the rise of eSports presents exciting opportunities in data analytics and marketing.
There’s been an explosive growth in esports popularity over recent years, fuelled by games specifically designed with online competition in mind. Blizzard’s Overwatch is a case in point. When the Overwatch League debuted in January 2018, 415,000 viewers tuned in to watch.
The stakes are high. Each team in the Overwatch League stumped up $20 million (£14.4 million) for a city franchise. Participating gamers enjoy $50,000 (£36,000) salaries while competing for a prize pool totaling a cool $3.5 million (£2.5 million).
AI and ESPORTS
Just as data analytics is helping golfers, athletes, F1 teams, football clubs and cricketers improve their performance, esports is well-placed to follow suit. As with any sport, winning doesn’t just hinge on skill, dedication and luck. It’s often determined by strategy and the analysis of past performance. The secrets to success lie in data and esports is overflowing with it.
We can divide the idea of AI and Esports into 3 different aspects or perspectives. It can be as AI playing gaming sports themselves, game analytics platforms to provide the insights and details about the players and their gaming behavior and tactics and lastly, data science in the gaming industry to manage the business side of the games as products
Let us have a look at each of the aspects and discuss them in detail one by one!
ESPORTS by AI
Gamers have been pitting their wits and skill against computers since the earliest days of video games. The levels of difficulty were pre-programmed, and at a certain point in the game, the computer was simply unbeatable by all but the most gifted gamers.
Over time, the concept of difficulty levels evolved. For example, “Madden” NFL Football games have four different levels (ranging from Rookie to All-Madden) that make running plays more difficult, while first-person shooter (FPS) games like “Duke Nukem 3D” follow the same type of tiered difficulty (ranging from Piece of Cake to Damn, I’m Good) that makes it tougher to stay alive and kill enemies.
The rise of machine learning combined with the increasing popularity of esports (organized, multiplayer video game competitions that feature professional video gamers gaming against each other with millions of dollars of prize money on the line), may inextricably link AI to gaming and esports.
That said, the most common implementation of AI in esports is in the games themselves. Companies like AI Gaming want to develop smarter AI bots that would compete against each other in an effort to grow smarter and more competitive, while OpenAI, a research lab co-founded by Elon Musk, developed AI that can beat the top 1% of Dota 2 amateurs (though the AI lost a best-of-three match to some of the game’s best players in August 2018).
DeepMind’s AlphaGo used some surprising tactics while playing against Lee Sedol. People thought these wouldn’t actually work, but they did. A similar discovery of new tactics and strategies can happen in eSports. Players will have to re-think their every move while playing. Situations like these will give them insight that might not have been possible.
CSGO has a ‘6th man’ setup where an observer advises the players on their strategies. A bot can instead replace the ‘6th man’, a form of ‘Augmented advising’. Teams will have to augment the bot’s recommendations into their gameplay. Teams who do this well will be the winners. Since a lot of machine learning algorithms are democratized there won’t be a situation where teams are unfairly matched.
Like I mentioned earlier StarCraft II is a game with quite a bit of strategic depth. This also makes the game more difficult for new-comers. The presence of an in-game coach would be helpful. It would speed up the process of getting started on the game and decrease the learning time.
ESports at the end of the day is a form of sports. People tune in from all over the world to watch their favourite teams play and cheer for them. Only this time, they’ll be rooting for 5 players and a bot.
Game analytics platforms
Shadow GG
Shadow.gg — a Counter-Strike analytics platform that its creators claim will cause a significant leap in how esports professionals currently approach preparing for a match by giving fast and easy access to a large number of in-game statistics for any match in the platform’s library. Built primarily for teams looking for a competitive edge, the tool aims to help scout opponents, quickly view data, and visualize that data in meaningful ways. It lessens the burden on coaches and analysts to scout demos and lets your coaches, analysts, and players focus on only important rounds.
The core value proposition here is that coaches and players that use this tool will be able to arrive at conclusions about their opponents’ play, and their own play, that is either too time intensive to arrive at through basic demo review, or simply can’t be reasoned about by trying to estimate data from observing matchplay. We can begin identifying trends for teams and players with regards to their tendencies in relation to the economic context, or how they utilize grenades, or how they prefer to retake a particular site, to name a few.
Obviously players still have to hit their shots in-game, and that’s on them; but going into a match armed with detailed information about which way your opponent leans in crucial situations could mean the difference between a comfortable win or a 16:14 loss; so we hope the value of the tool today and where we plan to take it over the next year will become rather self-evident.
NXTAKE — Advanced CSGO Analytics
Built by former daily fantasy sports professionals, NXTAKE is a leader in esports analytics and broadcast augmentation. Our company specializes in advanced analytics, data feeds, and esports prediction models. NXTAKE combines big data and simulation to bring next level analytics to the world of esports. Together, we have wagered enormous amounts of capital over the past few years and are sharing our expertise in a new and exciting industry. It can provide real-time analysis, coupled with live streams
Data science in esports
Targetting new gamers using data analytics
One of the best examples of data science in this area, customer “segmentation.”
This is a HIGHLY desired function within digital marketing because it’s the analysis of your existing and potential markets in an effort to better understand customers.
Doing this exercise, you can take in vast amounts of data from dozens of data sources (web, social, email, forums, media listening, etc) and feed it into statistical models to extract customer segments, like “your potential target market for your new game consists of those that classify themselves as hardcore gamers between the ages of 14 and 31 years old, that play RPGs like Skyrim, and that average a GTX 1070 GPU.”
What you can then do with this information, is to apply that segment to paid advertising strategies. So, when you start the pre-order push, you can make sure that your digital ads are targeted at the people that you were able to isolate to that segment, and not blasted to that CS: GO player that doesn’t like RPGs.
Competitive game pricing
The goal of an effective BI system in the gaming industry must be able gathering gamer data from several types of external sources, and comparing that data with data in internal systems to arrive at conclusive decisions about a customers spending pattern, tastes and levels of satisfaction. A large part of the data analyzed in this case may large volumes of unstructured, social-media data.
Improving gameplay experience
Insights from gaming analytics also enable companies to improve the gameplay itself. For example, millions of player records could be analyzed to pinpoint the most likely in-game moments when players quit the game entirely; perhaps a series of quests are too boring or the challenges are too hard/easy based on character level. Identifying these gaming “bottlenecks” is critical to understanding the reasoning & timing behind a game’s churn rate. Gaming Designers and Developers can then re-examine the game’s storylines, quests, and challenges in order to refine the gaming experience and, hopefully, reduce the number of lost subscribers.
Analyzing the devices used by players also helps developers to create gaming experiences that work effectively for their user base. Exploring a dungeon via an iPhone is quite different than doing it using a widescreen attached to a laptop, so developers need to address issues such as screen size, available functionality, navigation, and character interactions. Data analytics empowers companies to address this challenge by modeling and visualizing massive amounts of heterogeneous data.
Game analytics to improve gaming infrastructure
Today, games sometimes have global player bases… so the architecture supporting those users needs to be configured and implemented correctly. Online games are particularly prone to network-related metrics, such as ping and lag rates — these issues are exacerbated during peak gaming times. Again, Big Data analytics enables gaming companies to use server and network data to understand exactly when, and how, their infrastructure is being pushed to its limits. This knowledge enables companies to scale up or down according to player need; in today’s world of cloud-based PaaS/IaaS architectures (where cost is tied to usage), this information can have a dramatic impact on a company’s bottom line.
Analysing competitors
Make a list of games that are using the same theme and some (or all) of your core mechanics. Both released and upcoming. Especially upcoming, because chances are you’ll be judged against them.
Make a basic SWOT analysis for every one of them, but also add an additional field: “How is our game different?” The key word here is “different”, not “better”, so you won’t get caught in wishful thinking “we’ll have better graphics and better balance”. Why your target audience should consider your game instead of another one? People only have so much time to play.
You can also check geographical distribution and stats for released games on Steam Spy or AppAnnie, but, frankly, it’s not that useful at this stage. You’ll look into it later when deciding on focusing your marketing and localization efforts.
If you’ll decide to check geo distribution for similar games, don’t trust it too much — an audience research you did previously will be more helpful. Other games might’ve done something specific to become popular in some countries, like partnering with a local publisher or getting a good video from a local YouTube celebrity.
For example, there aren’t many owners of The Witcher 3 from Poland on Steam despite the game being immensely popular in that country. That’s because most Poles bought the game from CDP.pl or GOG.com instead of going for much more expensive Steam version.
Conclusion
The gaming industry has a long way to go when we talk about the application of full-fledged data science in its applications or AI bots beating world class players in the complex games like counter strike and DOTA. In this blog too, we looked at how different aspects of data sciences are applied in the gaming industry. But what is clear at this point is the power of AI and the myriad companies looking to harness the same. Gaming appears to be poised as a sector ripe for this type of disruption and companies are getting in early to explore the types of ways to profit off of connecting AI developments with esports.
Data Science is a study which deals with the identification, representation, and extraction of meaningful information from data. It can be collected from different sources to be used for business purposes.
With an enormous amount of facts generating each minute, the requirement to extract the useful insights is a must for the businesses. It helps them stand out from the crowd. Data engineers set up the data storage in order to facilitate the process of data mining, data munging activities. Every other organization is running behind profits. But the companies that formulate effective strategies based on insights always win the game in the long-run.
In this blog, we will be discussing new advancements or trends in the data science industry. Consecutively, these advancements are enabling it to tackle some of the trickiest problems across various businesses.
Top 5 Trends
Analytics and associated data technologies have emerged as core business disruptors in the digital age. As companies began the shift from being data-generating to data-powered organizations in 2017, data and analytics became the centre of gravity for many enterprises. In 2018, these technologies need to start delivering value. Here are the approaches, roles, and concerns that will drive data analytics strategies in the year ahead.
The Data Science Trends for 2018 are largely a continuation of some of the biggest trends of 2017 including Big Data, Artificial Intelligence (AI), Machine Learning (ML), along with some newer technologies like Blockchain, Serverless Computing, Augment Reality, and others that employ various practices and techniques within the Data Science industry.
If I am to pick top 5 data science trends right now (which can be very subjective but I will try it to justify the most), I will list them as
Artificial Intelligence
Cloud Services
AR/VR Systems
IoT Platforms
Big Data
Let us understand each of them in bit more detail!
Artificial Intelligence
Artificial intelligence (AI) is not new. It has been around for decades. However, due to greater processing speeds and access to vast amounts of rich data, AI is beginning to take root in our everyday lives.
From natural language generation and voice or image recognition to predictive analytics, machine learning, and driverless cars, AI systems have applications in many areas. These technologies are critical to bringing about innovation, providing new business opportunities and reshaping the way companies operate.
Artificial Intelligence is itself a very broad area to explore and study. But there are some components within artificial intelligence which are making quite a buzz around with their applications across business lines. Let us have a look at them one by one.
Natural language Processing
With advances in computational power and the integration of artificial intelligence, the natural language processing domain has evolved into a whirlwind of innovation. In fact, experts expect the NLP market to swell to an impressive $22.3 billion by 2025. One of the many applications of NLP in business is chatbots. Chatbots demonstrate utility in the customer service realm. These automated helpers can take care of simple frequently asked questions and other lookup tasks. This leaves customer service agents free to devote time to troubleshooting bigger matters that personalize and enhance the customer experience. Chatbots can save valuable time and energy for all members of the value stream. Chatbot technology is poised for considerable growth as speech and language processing tools become more robust by expanding beyond rules-based engines to include neural conversational models.
Deep Learning
You might think that Deep Learning sounds a lot like Artificial Intelligence, and that’s true to a point. Artificial Intelligence is a machine developed with the capability for intelligent thinking. Deep Learning, on the other hand, is an approach to Machine Learning which involves Artificial Neural Networks to work with the data. Today, there are more Deep Learning business applications than ever. In different cases, it can be the core offering of the product, such as self-driving cars. Over the past few years. It is found powering some of the world’s most powerful tech today: everything from entertainment media to self-driving cars. Some of the applications of deep learning in business include recommender systems, self-driving cars, image detection, and object classification.
Reinforcement Learning
The reinforcement learning model prophesies interaction between two elements — Environment and the learning agent. The learning agent leverages two mechanisms namely exploration and exploitation. When the learning agent acts on trial and error, it is termed as exploration, and when it acts based on the knowledge gained from the environment, it is referred to as exploitation. The environment rewards the agent for corrective actions, which is the reinforcement signal. Leveraging the rewards obtained, the agent improves its environment knowledge to select the next action. Now, artificial agents are being created to perform the tasks as a human. These agents have made their presence felt in businesses, and the use of agents driven by reinforcement learning is cut across industries. Some of the practical applications of reinforcement learning include robots driven in the factory, space management in warehouses, dynamic pricing agents, and driving financial investment decisions.
Cloud Services
The complexity in data science is increasing by the day. This complexity is driven by fundamental factors like increased data generation, low-cost storage, and cheap computational power. So, in summary, we are generating far more data, we can store it at a low cost and can run computations and simulations on this data at a low cost!
To tackle the increasing complexities in data science here is why we need cloud services
Need to run scalable data science
Cost
The larger ecosystem for machine learning system deployments
Use for building quick prototypes
In the field of cloud services, we have 3 major players in this field leading the pack. AWS(Amazon), Azure(Microsoft), GCP(Google).
Augmented Reality/Virtual Reality Systems
The Immersive Experience related to augmented reality (AR) and virtual reality (VR) is already changing the world around us. The human-machine interaction will improve as research breakthroughs in AR and VR come about. It is a claim made in a Gartner report, Augmented Analytics is the future of Data and Analytics, published in July 2017. Augmented analytics automates data insights through machine learning and natural language processing, enabling analysts to find patterns and prepare smart data that can be easily shared and operationalized. Accessible augmented analytics produces citizen data scientists and make an organization more agile.
IoT Platforms
Internet of things refers to a network of objects, each of which has a unique IP address & can connect to the internet. These objects can be people, animal and day to day devices like your refrigerator and your coffee machine. These objects can connect to the internet (and to each other) and communicate with each other through this net, in ways which have not been thought before. The data from current IoT pilot rollouts (sensors, smart meters, etc) will be used to make smart decisions using predictive analytics. E.g., forecast electricity usage from each smart meter to better plan distribution; forecast power output of each wind turbine in a wind farm; predictive maintenance of machines, etc.
The power of Big Data
Big data is a term to refer to data sets that are too large or complex for traditional data-processing application software to adequately deal with.
It was a significant trend in data science in 2017 but lately, there have been some advancements in Big data lately which has made it a trend in 2018 too. Let us have a look at some of them
Block Chain
Data science is a central part of virtually everything — from business administration to running local and national governments. At its core, the subject aims at harvesting and managing data so organizations can run smoothly. For some time now, data scientists have been unable to share, secure and authenticate data integrity. Thanks to bitcoin being overly hyped, the blockchain, the technology that underpins it, got the attentive eyes of data specialists. Blockchain Improves data integrity, provides easy and trusted means of data sharing capabilities, enable real-time analysis and data traceability. With robust security and transparent record keeping, blockchain is set to help data scientists achieve many milestones that were previously considered impossible. Although the decentralized digital ledgers are still a novice technology, the preliminary results from companies experimenting on them, like IBM and Walmart, prove that they work.
Handling Datastreams
Stream Processing is a Big data technology. It enables users to query continuous data stream and detect conditions fast within a small time period from the time of receiving the data. The detection time period may vary from few milliseconds to minutes. For example, with stream processing, you can receive an alert by querying a data streams coming from a temperature sensor and detecting when the temperature has reached the freezing point. Streaming data possesses immense capabilities which makes it a running trend in Big data till date.
Apache Spark
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. With apache releasing new features time by time to its spark library (Spark streaming, GraphX etc), it has been able to maintain its hold as a trend in Big Data till data
Conclusion
This is only the beginning, as data science continues to serve as the catalyst in the changes you are going to experience in business and technology. It is now up to you on how to efficiently adapt to these changes and help your own business flourish.
A data science project requires numerous iterations that are time-consuming. When dealing with numbers and data interpretations, it goes without question that you have to be quite smart and proactive.
It’s not surprising that iterations can be frustrating if they require regular updates. Sometimes the model is six months old that needs current information or other times you miss out on some data, so the analysis has to be done all over again. In this article, we will focus on the ways by which the Data Scientists and Business Analysts can increase productivity without spending so much time on unhealthy iterations.
Tips and Tricks for data scientists
Keeping the bigger picture in mind
Long-term goals should be considered a priority when doing the analysis. There could be many small issues rising up but that shouldn’t outcast the bigger ones. Be observant in deciding the problems that are going to affect the organization on a larger scale. Focus on those bigger problems and look for stable solutions. A data Scientists and Business analysts have to be visionary to manifest solutions.
Understanding the problem and keeping the requirements at hand
Data science is not about implementing a fancy/complex algorithm or doing some complex data aggregation. Data science is more about providing a solution to the problem at hand. All the tools like ML, visualization or optimization algorithms are just meant through which one can arrive at a suitable solution. Always understand the problem you are trying to solve. One should not jump directly to machine learning or statistic right after getting the data. We should analyze what data is about and what all you need to know and perform to come to the solution of your problem. Also, it is important to always keep an eye of the feasibility of the solution in terms of implementation. A good solution is always the one which is easily implementable. Always know what all you need to achieve a solution to the problems.
More real-world oriented approach
Data science involves providing a solution to real-world use cases. Hence one should always keep a real-world oriented approach. One should always focus on the domain/business use case of the problem at hand and the solution to be implemented rather than just purely looking at it from the technical side. Technical aspect focusses on the correctness of the solution but the business aspect focusses on the implementation and usage aspect of the solution. Sometimes you may not need a complex incomprehensive algorithm to meet your requirements rather you are happier with a simple algorithm which may not give as a correct result as previous one but its accuracy can be traded with its comprehensible attribute. Knowledge of technical aspect is a must but
Not everything is ML
Recently, machine learning has seen a great advancement in its application in various business applications. With great prediction capabilities, machine learning can solve many of the complex problems in various business scenarios. But one should not that, data science is not about only machine learning. Machine learning is just a small part of it. Data science is more about arriving at a feasible solution for a given problem. One should focus on areas like data cleaning, data visualization, and ability to extensively explore the data and find relations between the various attributes. It is about the ability to crunch out meaningful numbers which matter the most. A good data scientist focusses more on all the above qualities rather than just trying to fit machine learning algorithms on the problem statements
Programming Languages
It is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Data cleaning and EDA
Exploratory Data Analysis is one of the important steps in the data analysis process. Here, the focus is on making sense of the data in hand — things like formulating the correct questions to ask to your dataset, how to manipulate the data sources to get the required answers, and others. This is done by taking an elaborate look at trends, patterns, and outliers using a visual method. Let us say you are cleaning data for language processing tasks, and simple models might give you the best result. Cleaning is one of the most complex processes in data science, since almost every data available or extracted for language processing tasks is unstructured. It is a fact that a highly processed and neatly structured data will yield better results than a noisy one. We should rather try to perform cleaning task with simple regular expressions rather than using complex tools
Always open to learning more and more
“Data Science is a journey, not a destination”. This line gives us an insight into how huge the data science domain is and why constant learning is as important as build intelligent models. Practitioners who keep themselves updated with the new tech being developed every day, are able to implement and solve business problems faster. With all the resources available on the internet like MOOCs, one can easily make use of these to be updated. Also showcasing your skill on your blog or Github is an important hack which most of us are unaware of. This not only benefits their “The man who is too old to learn was probably always too old to learn.”
Evaluating Models and avoiding overfit
Separate the data into two sets ౼ the training set and the testing set to get a stronger prediction of an outcome. Cross-validation is the most convenient method to analyze numerical data without over-fitting. It examines the out-of-sample fit.
Converting findings into the actions
Again, this might sound like a simple tip – but you see both the beginners as well as the advanced people falter on it. The beginners would perform steps in excel, which would include copy paste of data. For the advanced users, any work done through command line interface might not be reproducible. Similarly, you need to extra cautious while working with notebooks. You should control your urge to go back and change any previous step which uses the dataset which has been computed later in the flow. Notebooks are very powerful to maintain a flow. If we do not maintain the flow, it can be very tardy as well.
Taking Rest
When do I work the best? It’s when I provide myself a 2–3 hours window to work on a problem/project. You can’t multi-task as a data scientist. You need to focus on a single problem at a time to make sure you get the best out of yourself. 2– 3-hour chunks work best for me, but you can decide yours.
Conclusion
Data science requires continuous learning and it is more of a journey rather than a destination. One always keep learning more and more about data science hence one should always keep above tricks and tips in his/her arsenal to boost up the productivity of their own self and are able to deliver more value to complex problems which can be solved with simple solutions! Stay tuned for more articles on data science.
Now suppose you read a question about a topic like overfitting. You can read the text and memorize the answer. Usually, articles with this heading (Interview Questions and Answers) are normally constructed that way, with plain text questions and answers. You could follow that route for interview preparation, but it is simply not the right thing to do. I can give you a list of important questions, with answers. Which is exactly what I will do in this article, later.
But you need to understand one thing clearly.
You cannot learn programming and data science from books alone.
You can learn the heading and the words. But the concept will truly be understood only in a practical manner; in a mini-project or in a worked-out example on the computer.
Data science is similar to programming in this regard.
Books are meant to just start your journey.
The real learning begins only when you implement it in code by yourself.
To take an example:
Question from the Interviewer:
“What is cross-validation and why is it important? How does it eliminate overfitting?”
A Good Answer:
“Cross-validation eliminates overfitting by exposing the model to the entire data set in a statistically uniform manner. Overfitting happens when the training set and test sets are not properly selected. If a model like LogisticRegression is trained until the error rate is very small, it may not be able to generalize to the pattern of data found in the test set. Hence the performance of the model would be excellent on the training set, but poor on the test set. This is because the model has overfitted itself to the training data. Thus, when presented with test data, error values increase because the generalization capacity of the model has been decreased and the model cannot discover the patterns of the test data.”
“K-fold Cross Validation prevents this by first dividing the total data into k sections and using one section as the test set and the remaining sections as the training set. We train k models, each time using a different fold as the test set and the remaining folds as the training set. Thus, we cover as many combinations of the training and test set as possible as input data. Finally, we take an average of the results of each model and return that as the output. So, overfitting is eliminated by using the entire data as input, one section (one of the k folds) being left out at a time to use as a test set. A common value for k is 10.”
Question:
“Can you show me how that works by coding it on a 10 by 10 array of integers? In Python?”
Worst Case Answer:
…
“Ummmmmmmm…..”
“Sorry sir, I just studied that in a textbook. I am not sure how I could work through that by code.”
(!!!)
You Can’t Study Without Implementation
Data science should be studied in the way programming is studied. By working at it on a computer and running all the models in your textbook, and finally, doing your own mini-project, on every topic that could be important. Can you learn to drive a car by reading about it in a book? You need practical experience! Otherwise, all your preparation is meaningless. That is the point I wanted to make.
Now, having established this, I assume from here on that you are a data scientist in training who has worked the fundamental details on a computer and is familiar with the basics. You just need the finishing touches on your interview preparation. If that is the case; here are your topics for mini-projects and experiments! And – interview questions with answers.
This is a site that allows you to sharpen your skills in Python for interviews. There are many more sites like these, all you need to do is Google ‘Python Interview Questions’.
Many people know Python, but R is not as commonly known. The above tutorial spans 30 pages that you can work through with your R console to learn the basics. Alternatively, you could try Swirl (link given below), which is also highly recommended for beginners.
Oh, what are kernels? Kaggle Kernels are online Jupyter notebooks that allow you to run Python and R code interactively with your browser in the same application without any local processing. All computation is done on the Kaggle servers.
Top Ten Essential Data Science Questions with Answers
1. What is a normal distribution? And how is it significant in data science?
The normal distribution is a probability distribution, characterized by its mean and standard deviation or variance. The normal distribution with a mean of 0 and a variance of 1 looks like a bell, hence it is also referred to as the bell curve. The central limit theorem makes the normal distribution ubiquitous in data science. In its essence, the central limit theorem states that data values tend to be attracted to the normal distribution shape as the number of samples is increased without limit. This theorem is used in data science nearly everywhere, because it gives you an ‘expected’ value for an arbitrary dataset that has, say, n = one thousand samples. As n increases, if the data is normally distributed, the shape of the graph of that attribute will tend to look like the bell curve.
2. What do you mean by A/B testing?
An A/B test records the results of two random variables or hypotheses (depending upon the scenario) and compares the rate of success or accuracy for the variable being in the state of A or the state of B. This often tells us which feature should be used to build a machine learning model. It is also used to select which model to use in the first place. A/B testing is a general concept that can be applied to nearly every system.
3. What are eigenvalues and eigenvectors?
The eigenvectors of a matrix that is non-singular (determinant not = 0) are the values associated with linear transformations of that matrix. They are calculated using the correlation or covariance matrix functions. The eigenvalues are the values associated with the strength or the degree of a linear transformation (such as bending or rotating). See Linear Algebra by Gilbert Strang (online ebook) for more details on their computation.
4. How do the recommender systems in Amazon and Netflix work? (research paper pdf)
Recommender systems in Amazon and Netflix are considered top-secret and are usually described as black boxes. But their internal mechanism has been partially worked out by researchers. A recommender system, predated by expert systems models in the 90s, is used to generate rules or ‘explanations’ as to why a product might be more attractive to user X than user Y. Complex algorithms are used, which have many inputs, such as past history genre, to generate the following types of explanations: functional, intentional, scientific and causal. These explanations, which can also be called user-invoked, automatic or intelligent, are tuned by certain metrics such as user satisfaction, user rating, trust, reliability, effectiveness, persuasiveness etc. The exact algorithm still remains an industry secret, similar to the way that Google keeps the algorithms that perform PageRank secret and constantly updated (500-600 times a year in the case of Google).
5. What is the probability of an impossible event, a past event and what is the range of a probability value?
An impossible event E has P(E) = 0. Probabilities take on values only in the closed interval [0, 1]. The probability of event that is from the past is an event that has already occurred and here P(E) = 1.
6. How do we treat missing values in datasets?
A categorical missing value is given its default value. A continuous missing value is usually assigned using the normal distribution, or the measures of central tendency like mean, median and mode. If a feature has less than 20% available data, the recommendation is to delete that feature from the model.
7.Which is faster, Python or R?
Python is considered to be moderately medium-paced since C++ is much faster for all purposes. Besides which, Python is an interpreted and not a compiled language. Python language is implemented in C to speed up execution time. R, however, was designed by statisticians, not computer scientists, and is much slower than Python.
8. What is Deep Learning and why is it such a popular buzzword in the machine learning field right now?
For many years, until around 2006, backpropagation neural networks had just three layers – one input, one hidden and one output layer. The problem with this model was that since it used gradient descent and the backpropagation algorithm, the neural nets had a tendency to be attracted towards the local minima in the hyperplane that represented the dimensions of the input features. Thus, NNs could not be used for many applications optimally, since they could only find a partially optimal solution. In 2006, Geoffrey Hinton et. al. published a research paper that showed that multilayer neural networks could overcome the problem of local minima since, in thousands of dimensions, local minima are statistically so rare as to never be found in the back-propagation process (saddle points are common instead). Deep learning refers to neural nets with 3 or more (even 10) hidden layers. They require more computational power and were one of the reasons that GPUs started to be used by the machine learning community for implementation of deep learnings NNs. Since 2010-2012, deep learning has been applied to nearly every single technology domain, and the models have been highly accurate and successful in all areas from speech recognition to playing the Japanese game of Go.
9. What is the difference between machine learning and deep learning?
For more details on that, I suggest you go through this excellent article, given on the following link on our blog below:
To finally sum up, I have to say, enjoy your work. You will be much better at what you love than something that is glamorous but not to your taste. Artificial Intelligence, Data Science, Software Development and Machine Learning are very much in my preferred line of work, and my hope is, that it will be in yours too. Don’t just read the text, work out the code on your systems or on Kaggle. That is how to best prepare for interview questions. Only practice at your computer (preferably on Kaggle) will give you true confidence on the day of your interview. That is true expertise – practice making perfect. Enjoy data science!
Never thought that online trading could be so helpful because of so many scammers online until I met Miss Judith... Philpot who changed my life and that of my family. I invested $1000 and got $7,000 Within a week. she is an expert and also proven to be trustworthy and reliable. Contact her via: Whatsapp: +17327126738 Email:judithphilpot220@gmail.comread more
A very big thank you to you all sharing her good work as an expert in crypto and forex trade option. Thanks for... everything you have done for me, I trusted her and she delivered as promised. Investing $500 and got a profit of $5,500 in 7 working days, with her great skill in mining and trading in my wallet.
judith Philpot company line:... WhatsApp:+17327126738 Email:Judithphilpot220@gmail.comread more
Faculty knowledge is good but they didn't cover most of the topics which was mentioned in curriculum during online... session. Instead they provided recorded session for those.read more
Dimensionless is great place for you to begin exploring Data science under the guidance of experts. Both Himanshu and... Kushagra sir are excellent teachers as well as mentors,always available to help students and so are the HR and the faulty.Apart from the class timings as well, they have always made time to help and coach with any queries.I thank Dimensionless for helping me get a good starting point in Data science.read more
My experience with the data science course at Dimensionless has been extremely positive. The course was effectively... structured . The instructors were passionate and attentive to all students at every live sessions. I could balance the missed live sessions with recorded ones. I have greatly enjoyed the class and would highly recommend it to my friends and peers.
Special thanks to the entire team for all the personal attention they provide to query of each and every student.read more
It has been a great experience with Dimensionless . Especially from the support team , once you get enrolled , you... don't need to worry about anything , they keep updating each and everything. Teaching staffs are very supportive , even you don't know any thing you can ask without any hesitation and they are always ready to guide . Definitely it is a very good place to boost careerread more
The training experience has been really good! Specially the support after training!! HR team is really good. They keep... you posted on all the openings regularly since the time you join the course!! Overall a good experience!!read more
Dimensionless is the place where you can become a hero from zero in Data Science Field. I really would recommend to all... my fellow mates. The timings are proper, the teaching is awsome,the teachers are well my mentors now. All inclusive I would say that Kush Sir, Himanshu sir and Pranali Mam are the real backbones of Data Science Course who could teach you so well that even a person from non- Math background can learn it. The course material is the bonus of this course and also you will be getting the recordings of every session. I learnt a lot about data science and Now I find it easy because of these wonderful faculty who taught me. Also you will get the good placement assistance as well as resume bulding guidance from Venu Mam. I am glad that I joined dimensionless and also looking forward to start my journey in data science field. I want to thank Dimensionless because of their hard work and Presence it made it easy for me to restart my career. Thank you so much to all the Teachers in Dimensionless !read more
Dimensionless has great teaching staff they not only cover each and every topic but makes sure that every student gets... the topic crystal clear. They never hesitate to repeat same topic and if someone is still confused on it then special doubt clearing sessions are organised. HR is constantly busy sending us new openings in multiple companies from fresher to Experienced. I would really thank all the dimensionless team for showing such support and consistency in every thing.read more
I had great learning experience with Dimensionless. I am suggesting Dimensionless because of its great mentors... specially Kushagra and Himanshu. they don't move to next topic without clearing the concept.read more
My experience with Dimensionless has been very good. All the topics are very well taught and in-depth concepts are... covered. The best thing is that you can resolve your doubts quickly as its a live one on one teaching. The trainers are very friendly and make sure everyone's doubts are cleared. In fact, they have always happily helped me with my issues even though my course is completed.read more
I would highly recommend dimensionless as course design & coaches start from basics and provide you with a real-life... case study. Most important is efforts by all trainers to resolve every doubts and support helps make difficult topics easy..read more
Dimensionless is great platform to kick start your Data Science Studies. Even if you are not having programming skills... you will able to learn all the required skills in this class.All the faculties are well experienced which helped me alot. I would like to thanks Himanshu, Pranali , Kush for your great support. Thanks to Venu as well for sharing videos on timely basis...😊
I highly recommend dimensionless for data science training and I have also been completed my training in data science... with dimensionless. Dimensionless trainer have very good, highly skilled and excellent approach. I will convey all the best for their good work. Regards Avneetread more
After a thinking a lot finally I joined here in Dimensionless for DataScience course. The instructors are experienced &... friendly in nature. They listen patiently & care for each & every students's doubts & clarify those with day-to-day life examples. The course contents are good & the presentation skills are commendable. From a student's perspective they do not leave any concept untouched. The step by step approach of presenting is making a difficult concept easier. Both Himanshu & Kush are masters of presenting tough concepts as easy as possible. I would like to thank all instructors: Himanshu, Kush & Pranali.read more
When I start thinking about to learn Data Science, I was trying to find a course which can me a solid understanding of... Statistics and the Math behind ML algorithms. Then I have come across Dimensionless, I had a demo and went through all my Q&A, course curriculum and it has given me enough confidence to get started. I have been taught statistics by Kush and ML from Himanshu, I can confidently say the kind of stuff they deliver is In depth and with ease of understanding!read more
If you love playing with data & looking for a career change in Data science field ,then Dimensionless is the best... platform . It was a wonderful learning experience at dimensionless. The course contents are very well structured which covers from very basics to hardcore . Sessions are very interactive & every doubts were taken care of. Both the instructors Himanshu & kushagra are highly skilled, experienced,very patient & tries to explain the underlying concept in depth with n number of examples. Solving a number of case studies from different domains provides hands-on experience & will boost your confidence. Last but not the least HR staff (Venu) is very supportive & also helps in building your CV according to prior experience and industry requirements. I would love to be back here whenever i need any training in Data science further.read more
It was great learning experience with statistical machine learning using R and python. I had taken courses from... Coursera in past but attention to details on each concept along with hands on during live meeting no one can beat the dimensionless team.read more
I would say power packed content on Data Science through R and Python. If you aspire to indulge in these newer... technologies, you have come at right place. The faculties have real life industry experience, IIT grads, uses new technologies to give you classroom like experience. The whole team is highly motivated and they go extra mile to make your journey easier. I’m glad that I was introduced to this team one of my friends and I further highly recommend to all the aspiring Data Scientists.read more
It was an awesome experience while learning data science and machine learning concepts from dimensionless. The course... contents are very good and covers all the requirements for a data science course. Both the trainers Himanshu and Kushagra are excellent and pays personal attention to everyone in the session. thanks alot !!read more
Had a great experience with dimensionless.!! I attended the Data science with R course, and to my finding this... course is very well structured and covers all concepts and theories that form the base to step into a data science career. Infact better than most of the MOOCs. Excellent and dedicated faculties to guide you through the course and answer all your queries, and providing individual attention as much as possible.(which is really good). Also weekly assignments and its discussion helps a lot in understanding the concepts. Overall a great place to seek guidance and embark your journey towards data science.read more
Excellent study material and tutorials. The tutors knowledge of subjects are exceptional. The most effective part... of curriculum was impressive teaching style especially that of Himanshu. I would like to extend my thanks to Venu, who is very responsible in her jobread more
It was a very good experience learning Data Science with Dimensionless. The classes were very interactive and every... query/doubts of students were taken care of. Course structure had been framed in a very structured manner. Both the trainers possess in-depth knowledge of data science dimain with excellent teaching skills. The case studies given are from different domains so that we get all round exposure to use analytics in various fields. One of the best thing was other support(HR) staff available 24/7 to listen and help.I recommend data Science course from Dimensionless.read more
I was a part of 'Data Science using R' course. Overall experience was great and concepts of Machine Learning with R... were covered beautifully. The style of teaching of Himanshu and Kush was quite good and all topics were generally explained by giving some real world examples. The assignments and case studies were challenging and will give you exposure to the type of projects that Analytics companies actually work upon. Overall experience has been great and I would like to thank the entire Dimensionless team for helping me throughout this course. Best wishes for the future.read more
It was a great experience leaning data Science with Dimensionless .Online and interactive classes makes it easy to... learn inspite of busy schedule. Faculty were truly remarkable and support services to adhere queries and concerns were also very quick. Himanshu and Kush have tremendous knowledge of data science and have excellent teaching skills and are problem solving..Help in interviews preparations and Resume building...Overall a great learning platform. HR is excellent and very interactive. Everytime available over phone call, whatsapp, mails... Shares lots of job opportunities on the daily bases... guidance on resume building, interviews, jobs, companies!!!! They are just excellent!!!!! I would recommend everyone to learn Data science from Dimensionless only 😊read more
Being a part of IT industry for nearly 10 years, I have come across many trainings, organized internally or externally,... but I never had the trainers like Dimensionless has provided. Their pure dedication and diligence really hard to find. The kind of knowledge they possess is imperative. Sometimes trainers do have knowledge but they lack in explaining them. Dimensionless Trainers can give you ‘N’ number of examples to explain each and every small topic, which shows their amazing teaching skills and In-Depth knowledge of the subject. Himanshu and Kush provides you the personal touch whenever you need. They always listen to your problems and try to resolve them devotionally.
I am glad to be a part of Dimensionless and will always come back whenever I need any specific training in Data Science. I recommend this to everyone who is looking for Data Science career as an alternative.
All the best guys, wish you all the success!!read more