Data scientists are one of the most hirable specialists today, but it’s not so easy to enter this profession without a “Projects” field in your resume. Furthermore, you need the experience to get the job, and you need the job to get the experience. Seems like a vicious circle, right? Also, the great advantage of data science projects is that each of them is a full-stack data science problem. Additionally, this means that you need to formulate the problem, design the solution, find the data, master the technology, build a machine learning model, evaluate the quality, and maybe wrap it into a simple UI. Hence, this is a more diverse approach than, for example, Kaggle competition or Coursera lessons.
Hence, in this blog, we will look at 10 projects to undertake in 2019 to learn data science and improve your understanding of different concepts.
1. Match Career Advice Questions with Professionals in the Field
Problem Statement: The U.S. has almost 500 students for every guidance counselor. Furthermore, youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn. Also, to date, 25,000 create profiles and opt-in to receive emails when a career question is a good fit for them. This is where your skills come in. Furthermore, to help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications for the volunteers seem to have the greatest impact on how many questions are answered.
Your objective: Develop a method to recommend relevant questions to the professionals who are most likely to answer them.
Problem Statement: In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. Also, the data for this competition is a slightly different version of the PatchCamelyon (PCam) benchmark dataset. PCam is highly interesting for both its size, simplicity to get started on, and approachability.
Your objective: Identify metastatic tissue in histopathologic scans of lymph node sections
Problem Statement: To assess the impact of climate change on Earth’s flora and fauna, it is vital to quantify how human activities such as logging, mining, and agriculture are impacting our protected natural areas. Furthermore, researchers in Mexico have created the VIGIA project, which aims to build a system for autonomous surveillance of protected areas. Moreover, the first step in such an effort is the ability to recognize the vegetation inside the protected areas. In this competition, you are tasked with the creation of an algorithm that can identify a specific type of cactus in aerial imagery.
Your objective: Determine whether an image contains a columnar cactus
Problem Statement: In a world, where movies made an estimate of $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it’s “You had me at ‘Hello. In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Also, the data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. Furthermore, you can collect other publicly available data to use in your model predictions.
Your objective: Can you predict a movie’s worldwide box office revenue?
Problem Statement: An existential problem for any major website today is how to handle toxic and divisive content. Furthermore, Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
In this competition, you need to develop models that identify and flag insincere questions. Moreover, to date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.
Your objective: Detect toxic content to improve online conversations
Problem Statement: This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset. You are given 5 years of store-item sales data and asked to predict 3 months of sales for 50 different items at 10 different stores. What’s the best way to deal with seasonality? Should stores be modelled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost? Also, this is a great competition to explore different models and improve your skills in forecasting.
Your Objective: Predict 3 months of item sales at different stores
Problem Statement: This competition focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the competition at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles. Also, the sequential or temporal observations emerge in many key real-world problems, ranging from biological data, financial markets, weather forecasting, to audio and video processing. Moreover, the field of time series encapsulates many different problems, ranging from analysis and inference to classification and forecast. What can you do to help predict future views?
Problem Statement: Forecast future traffic to Wikipedia pages
Problem Statement: What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answer “all pursuits are under the immutable laws of physics” we’ll begrudgingly give you partial credit. Also, If you answer “people chose randomly for a transfer learning competition”, congratulations, we accept your answer and mark the question as solved.
In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from “extreme-tourism Antarctica”? Let’s find out.
Your objective: Predict tags from models trained on unrelated topics
Problem Statement: MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. Furthermore, in this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.
Your objective: Learn computer vision fundamentals with the famous MNIST data
Problem Statement: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Furthermore, this sensational tragedy shocked the international community and led to better safety regulations for ships. Also, one of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive.
Your objective: Predict survival on the Titanic and get familiar with ML basics
A decade ago, machine learning was simply a concept but today it has changed the way we interact with technology. Devices are becoming smarter, faster and better, with Machine Learning at the helm.
Thus, we have designed a comprehensive list of projects in Machine Learning course that offers a hands-on experience with ML and how to build actual projects using the Machine Learning algorithms. Furthermore, this course is a follow up to our Introduction to Machine Learning course and delves further deeper into the practical applications of Machine Learning.
Progressing step by step
In this blog, we will have a look at projects divided mostly into two different levels i.e. Beginners and Advanced. First, projects mentioned under the beginner heading cover important concepts of a particular technique/algorithm. Similarly, projects under advanced category involve the application of multiple algorithms along with key concepts to reach the solution of the problem at hand.
Projects offered by Dimensionless Technologies
We have tried to take a more exciting approach to Machine Learning, by not working on simply the theory of it, but instead by using the technology to actually build real-world projects that you can use. Furthermore, you will learn how to write the codes and then see them in action and actually learn how to think like a machine learning expert.
Following are some of the projects among many others that they cover in their courses:
Disease Detection — In this project, you will use the K-nearest neighbor algorithm to help detect breast cancer malignancies by using a support vector machine.
Credit Card Fraud Detection — In this project, you are going to do a credit card fraud detection and going to focus on anomaly detection by using probability densities.
Stock Market Clustering Project — In this project, you will use a K-means clustering algorithm to identify related companies by finding correlations among stock market movements over a given time span.
1) Iris Flowers Classification ML Project– Learn about Supervised Machine Learning Algorithms
Iris flowers dataset is one of the best data sets in classification literature. The classification of the iris flowers machine learning project is often referred to as the “Hello World” of machine learning. Furthermore, this dataset has numeric attributes and beginners need to figure out how to load and handle data. Also, the iris dataset is small which easily fits into the memory and does not require any special transformations or scaling, to begin with.
The goal of this machine learning project is to classify the flowers into among the three species — virginica, setosa, or versicolor based on length and width of petals and sepals.
2) Social Media Sentiment Analysis using Twitter Dataset
Platforms like Twitter, Facebook, YouTube, Reddit generate huge amounts of big data that can be mined in various ways to understand trends, public sentiments, and opinions. A sentiment analyzer learns about various sentiments behind a “content piece” through machine learning and predicts the same using AI. Also, Twitter data is considered a definitive entry point for beginners to practice sentiment analysis. Hence, using Twitter dataset, one can get a captivating blend of tweet contents and other related metadata such as hashtags, retweets, location and more which pave way for insightful analysis. Using Twitter data you can find out what the world is saying about a topic whether it is movies, sentiments about any trending topic. Probably, working with the Twitter dataset will help you understand the challenges associated with social media data mining and also learn about classifiers in depth.
3) Sales Forecasting using Walmart Dataset
Walmart dataset has sales data for 98 products across 45 outlets. Also, the dataset contains sales per store, per department on weekly basis. The goal of this machine learning project is to forecast sales for each department in each outlet consequently which will help them make better data-driven decisions for channel optimization and inventory planning. Certainly, the challenging aspect of working with Walmart dataset is that it contains selected markdown events which affect sales and should be taken into consideration.
In the book Moneyball, the Oakland A’s revolutionized baseball through analytical player scouting. Furthermore, they built a competitive squad while spending only 1/3 of what large market teams like the Yankees were paying for salaries.
First, if you haven’t read the book yet, you should check it out. Ceratinly, It’s one of our favorites!
Fortunately, the sports world has a ton of data to play with. Data for teams, games, scores, and players are all tracked and freely available online.
There are plenty of fun machine learning projects for beginners. For example, you could try…
Sports Betting… Predict box scores given the data available at the time right before each new game.
Talent scouting… Use college statistics to predict which players would have the best professional careers.
General managing… Create clusters of players based on their strengths in order to build a well-rounded team.
Sports is also an excellent domain for practicing data visualization and exploratory analysis. You can use these skills to help you decide which types of data to include in your analyses.
Sports Statistics Database — Sports statistics and historical data covering many professional sports and several college ones. The clean interface makes it easier for web scraping.
Sports Reference — Another database of sports statistics. More cluttered interface, but individual tables can be exported as CSV files.
cricsheet.org — Ball-by-ball data for international and IPL cricket matches. CSV files for IPL and T20 internationals matches are available.
As the name suggests (no points for guessing), this dataset provides the data on all the passengers who were aboard the RMS Titanic when it sank on 15 April 1912 after colliding with an iceberg in the North Atlantic ocean. Also, it is the most commonly used and referred to data set for beginners in data science. With 891 rows and 12 columns, this data set provides a combination of variables based on personal characteristics such as age, class of ticket and sex, and tests one’s classification skills.
Objective: Predict the survival of the passengers aboard RMS Titanic.
Advance level projects
This is where an aspiring data scientist makes the final push into the big leagues. After acquiring the necessary basics and honing them in the first two levels, it is time to confidently play the big game. Certainly, these datasets provide a platform for putting to use all the learnings and take on new, and more complex challenges.
This data set is a part of the Yelp Dataset Challenge conducted by crowd-sourced review platform, Yelp. It is a subset of the data of Yelp’s businesses, reviews, and users, provided by the platform for educational and academic purposes.
In 2017, the tenth round of the Yelp Dataset Challenge was held and the data set contained information about local businesses in 12 metropolitan areas across 4 countries.
Rich data comprising 4,700,000 reviews, 156,000 businesses, and 200,000 pictures provides an ideal source of data for multi-faceted data projects. Projects such as natural language processing and sentiment analysis, photo classification, and graph mining among others, are some of the projects that can be carried out using this dataset containing diverse data. The data set is available in JSON and SQL formats.
Objective: Provide insights for operational improvements using the data available.
With the increasing demand to analyze large amounts of data within small time frames, organizations prefer working with the data directly over samples. Consequently, this presents a herculean task for a data scientist with a limitation of time.
This dataset contains information on reported incidents of crime in the city of Chicago from 2001 to the present. It does not contain data from the most recent seven days. Not included in the data set, is data on murder, where data is recorded for each victim.
It contains 6.51 million rows and 22 columns and is a multi-classification problem. In order to achieve mastery over working with abundant data, this dataset can serve as the ideal stepping stone.
Objective: Explore the data, and provide insights and forecasts about crimes in Chicago.
KKD cup is a popular data mining and knowledge discovery competition held annually. It is one of the first-ever data science competition which dates back to 1997.
Every year, the KDD cup provides data scientists with an opportunity to work with data sets across different disciplines. Some of the problems tackled in the past include
Identifying which authors correspond to the same person
Predicting the click-through rate of ads using the given query and user information
Development of algorithms for Computer Aided Detection (CAD) of early-stage breast cancer among others.
The latest edition of the challenge was held in 2017 and required participants to predict the traffic flow through highway tollgates.
Objective: Solve or make predictions for the problem presented every year.
Undertaking different kinds of projects is one of the good ways through which one can progress in any field. Certainly, this allows an individual to have hands on at the problems faced during the implementation phase. Also, it is easier to learn concepts by applying them. Finally, you will have a feeling of doing actual work rather than just being all lost in the theoretical part.
There are wonderful competitions available on kaggle and other similar data science competition platforms. Hence, make sure you take some time out and jump into these competitions. Whether you are a beginner or a pro, certainly, there is a lot of learning available while attempting these projects.