Research Springboard

Public and Open-Source Datasets

These links will take you to sites where you can download data sets. Many of them were likely compiled for specific purposes. As mentioned in the video for this module, this is just one method of finding data and this is a small sampling of what is out there if you search for datasets.

APIs

It is important to note that many APIs were not designed explicitly for data collection. In fact, some prohibit it. Dataset aggregation through use of an API is somewhat of a phenomenon and doesn’t always work. Here are some links to APIs, or really, documentation for APIs. We encourage you to look through the documentation and merely observe the examples and get a feel for what is possible and perhaps the intended uses. Many APIs require that you register for an API key to query the system.

Video “How to Create AI Datasets”

This is a succinct video from an AI researcher that goes a little further into the steps of creating your own dataset. Though it doesn’t go in to technical methods it gives a great overview of the process and possibilities of collection and preprocessing.

Batch Downloading Scripts

Some of these may or may not work and may or may not be aligned with API usage guidelines. If broken it might be because paid tools for accessing the data are now available and free access has been blocked. Note the use of Python here, and in many similar scripts (try searching for some more). Below we point out what Python is maybe using “under the hood” to query and download.

  • Instagram-Scraper: A command-line application written in Python that scrapes and downloads your (or the world’s) Instagram posts.

Reading Between The Lines (of Code)

This section lists some technical tools for data collection and preprocessing. If reading about them seems to go over your head or is not making any sense, just note that these tools exist and try to imagine what possibilities they present. You might be surprised to find a lot of these are already installed on your computer.

These two are non-interactive command line tools for not only making API requests (which might return json data for instance) but also can do many other things on the network from a terminal. For instance, wget can recursively download entire websites to make local copies. If a simple command can bring you data from the web, what happens when that is put in a loop with variables as arguments in a script?

Data Preprocessing (Cleaning)

As mentioned in both the module video, and the video above, raw data might need to be cleaned or preprocessed before being used in training a model. These are some very powerful and widely used tools for doing that.

This command line tool is used in so many image and video preprocessing scripts. It can slice videos into images and stitch images into videos. It can convert many types of files as well as crop and resize and so much more. Definitely a tool to be aware of.

This is a tricky tool to wield but there is nothing quite like it for JSON data. Again JSON is a very common format for API responses, but sometimes you only need a few of the name/value fields from the Objects that get returned. jq can filter JSON in just about any way imaginable and in a very efficient way (which starts to matter with large datasets).

Whatever format your data is in, there is probably a Python Library built to interface with that type of file. The built in functions of the language make this a solid option for filtering or organizing data.

These two are next level. For instance, they can swap every period in a file with a semi-colon. Or add up every number after a ‘$’ symbol. They are fully compatible for use with Regular Expressions (regex) and when working with datasets in the terabytes their simplicity can go a long way. If you follow the links and start reading about them try to take the abstract and esoteric language of their descriptions at face value.