On Wednesday, September 5, 2018, the data science world was following its usual routine. Data science enthusiasts, students, and professionals were searching for datasets with a simple google search, from Kaggle, from the UCI Machine Learning Repository and so many other resources. Just then, Google dropped one of the coolest tools for the domain of data science that has and will probably change the way people are going to acquire their datasets. This tool is called Dataset Search and it has been one of the hottest topics of discussion of the week regarding the domain of data science. You can try it out from this link.
Now, most of you may already be familiar with what a dataset is. But in case you’re not, let’s have a gentle introduction. Feel free to skip this section if you’re comfortable with datasets.
What is a dataset?
In simple words, a dataset is simply an organized collection of data in a particular domain of interest. This domain may be anything from Global Temperature Changes to Housing Prices to Breast Cancer to Stock Markets; anything where the user wants to explore and find solutions to the problems. Some of the popular datasets are the Iris dataset that contains the sepal and petal widths of different types of flowers, the MNIST dataset which contains data for handwritten digits 0 through 9, Boston Housing Price dataset that contains house prices corresponding to various features such as average number of rooms, per capita crime rate, etc. A dataset contains large amount of historical data about the domain of interest, using which data scientist generate models and algorithms, train and test the model, and hence later make decisions of its own. Data is the single most important thing for any such model and datasets are the bread and butter of a data scientist.
Note: If you’re not familiar with what data science is, please read this article first.
Hopefully, by now everyone has a general idea of what a dataset is and why it is important in data science. This now brings us to the next question.
Where can I get a dataset?
You have a problem and you need data to explore the problem. So you begin to search in the most obvious place, where everybody searches for almost everything – Google. Of course, there are other search engines, and we may have used it a few times now and then, but we have to admit it that Google is the most dominant search engine of all. Whether we want a dataset or just want to test our internet connection, we all go to Google. And Google just gives us back the links to the dataset in Kaggle or UCI Machine Learning Repository. And these are really good resources for datasets. In fact, Kaggle not only has datasets but also hosts a lot of competitions to develop machine learning algorithms; some even offer prize money to the best algorithms.
What does Google Dataset Search do differently?
Ultimately, Dataset Search also provides links to Kaggle, UCI ML Repository and other websites with the relevant dataset. Then the question may arise why we would even need the Dataset search. Let’s look at this by an example. Suppose that we want to find the dataset for Global Temperature changes. Following image depicts the result when searching in plain old Google.
And now let’s search for the same using the Dataset Search
The difference is clear. Simple google search gives us general results, which may or may not be relevant to us. Google Dataset Search gives us more specific results, to the point. On the left pane, it returns all the relevant websites from which we can acquire the data set; the first one here is from Kaggle. On the right pane, it gives us the details such as the dataset updated date, provided by, available download formats and descriptions for the website selected on the left pane. Try it and you’ll see for yourself. The page above also contains a plot for the relevant data.
So, the Google Dataset Search is simply a search engine designed specifically for the purpose of the ease of finding relevant dataset. Whereas traditional Google search deals with more generic results, Dataset Search gives us specific results with extra information. Not only this, as per Natasha Noy, Research Scientist in Google AI, this search engine will display the dataset directly from its hosted area – whether it’s the publisher’s site, author’s personal web page or a digital library, which may not always be easily accessible from traditional search.
If you’ve used Google Scholar, you’ll see that the concept is quite similar. Google Scholar helps us to find articles and research papers more easily. It also lets us create a library of our favorite articles to read. Dataset Search doesn’t have so many features as of yet. Google has introduced it as a Beta version, which means there are going to be a lot more changes, and hopefully a lot more new features that will create ease of access to the data more comfortable.
But not all data may be open and hence available for inspection. In Noy’s words, “The metadata needs to be open, the dataset itself does not need to be. For an analogy, think of a search you do on Google Scholar: It may well take you to a publisher’s website where the article is behind a paywall. Our goal is to help users discover where the data is and then access it directly from the provider.”
How does it work under the hood?
Google utilizes the schema markup for dataset providers in the Dataset Search. You can learn about schema markup from their official page. This markup allows publishers to add descriptions to their data such that search engines such as Google can understand the content of the page more efficiently. Google has developed certain guidelines for the dataset providers with respect to installing Dataset Search. It encourages the data providers to include information such as the creator of the dataset, the methodology of its collection, the date of its publication and update, and the terms under which the data can or cannot be used. This will help search engines to redirect information to anyone searching it more efficiently.
Right now, Dataset Search can be used to find most datasets on topics such as social sciences, environmental sciences, government data and the data provided by news organizations. When more and more data providers adapt to the schema markup, Google can and will expand the variety of datasets that can be accessed from the Dataset Search.
In summary, the Dataset Search is a one-stop shop for the datasets one need to find. Need to find data related to NASA and also from NOAA? No need to keep up to date with their websites. Google’s new search engine will handle that for you, in a much more efficient way. It is exciting times in data science with newer and better tools being developed. Especially given that these advancements come from tech giants like Google, we can expect for more of such tools in the future. We can also expect other tech giants to bring their own sets of tools to keep this healthy competition going and it is ultimately us that will benefit from it. The DataSet Search is going to be an important tool in the arsenal of a data scientist. Sure, it’s not the sharpest tool at the moment, but it’s one that is bound to get better and better with time.