AI Datasets

Recent advances in Deep Learning are only possible with the availability of large datasets: machine learning algorithms need data. They are only able to produce astonishing results if they have the data to find patterns in. Before we get into possible data sources, let’s do a quick overview about how supervised Artificial Intelligence systems work.

Supervised Learning

There are three classes of learning algorithms that are used in production systems. Reinforcement Learning is used in cases where an agent is able to explore a space (either known or unknown) and figure out rules and policies of how to act in that space. As a concrete example, consider an AI bot that can beat humans in the games of chess, go or Starcraft 2. They explore the possible moves in the game space and come up with a policy that lets them determine the optimal moves for any situation. The next class of learning algorithm is called unsupervised learning. In unsupervised learning, an algorithm is given a dataset and has to figure out some traits of that data. As an example, consider an algorithm that can cluster users by how likely they are to buy a particular product. Finally, we have the most common form of learning algorithm, which is called supervised learning. In supervised learning, we have some form of input (it could be images, data points about a user, audio and so on) and are trying to predict some output: what is in the image, what group the user falls into, what words are in the audio file, etc.

In this article, we will give some sources to find data to feed the supervised learning algorithms. The more data we have, the better the algorithms are at finding patterns and making predictions. Fortunately, there are many sources of data that are freely available to train your supervised learning models.

Public Datasets

Here, we outline a few freely available datasets that can be used for a variety of supervised learning tasks.

CIFAR

There are two datasets within the CIFAR dataset. There is the CIFAR-10, which has 60,000 images that map to 10 different classes. There is also the CIFAR-100, which has 60,000 images that map to 10 different classes.

ImageNet

This is one of the largest public datasets available. ImageNet contains over 14 million images in over 20,000 categories. Many innovative neural network architectures are developed using the ImageNet data as a benchmark.

OpenImages

This is an initiative put forth by Google. They have URLs of over 9 million images that map to 6,000 categories. OpenImages  continues to be updated as more image URLs are added.

YouTube-8M

This is a MASSIVE dataset taken from YouTube videos  with complete annotations on a frame-by-frame basis. They have over 4,000 entities annotated.

CelebFaces

Who doesn’t love celebrities? This dataset  contains faces with more than 200,000 celebrity images, each with 40 attribute annotations.

COCO

COCO is a large-scale object detection, segmentation, and captioning dataset. This is a joint effort between many of the powerhouses in AI: Google Brain, Facebook AI Research, Microsoft and others. While many of the datasets focus on classification of a certain class within a picture, COCO actually as pixel-level segmentation which can be useful for a variety of tasks.

COIL100

The Columbia University Image Library  contains 100 images that are mapped from all possible angles. This can be useful for many deep learning tasks where parts of an image might be occluded and need to be reconstructed.

Conclusion

This list is by no means exhaustive, but hopefully it gives an idea of what’s possible with public datasets. When faced with an AI challenge, it’s useful to do a quick search to see if datasets exist that match the problem we’re working on. Even if there’s not an exact match, more often than not, there will be one (or more) datasets that can be used as a starting point to help make even better models.

Often, not only do researchers release these great datasets, they also release the actual network architecture and weights used in their examples. With transfer learning, one can save a lot of model training time by using the existing model with parameters as a starting point. As an example, if you had a task to detect if a picture contained a particular kind of computer that you can’t find in any dataset, you can start out with an architecture trained on ImageNet and use a small set of images for the final training to solve your specific problem.