Ready, Set, Data!

A collection of interesting datasets and the tools to convert them into ready-to-use formats.

Features

curated and cleaned datasets: quality over quantity
all tools and pipelines are streaming: first results are available immediately
fields and units are clearly labeled and properly-typed
data is output in immediately usable formats (Parquet, Arrow, DuckDB, SQLite)
datasets conform to reasonable standards (UTF-8, RFC3339 dates, decimal lat/long coords, SI units)

Setup

Requires Python 3.8+.

git clone https://github.com/saulpw/readysetdata.git
cd readysetdata

Then from within the repository,

make setup

pip install .

python3 setup.py install

Datasets

Output is generated for all available formats and put in the OUTPUT directory (output/ by default). Size and time estimates are for JSONL output on a small instance.

`make movielens` (150MB, 3 tables, 5 minutes) (2019)

84k movies and 28m ratings from MovieLens

`make imdb` (20GB, 7 tables, 1 hour; updated daily)

9m movies/tv (1m rated), 7m tv episodes, 12m people from imdb.

`make geonames` (500MB, 2 tables, 10 minutes; updated quarterly)

2.2m US place names and lat/long coordinates from USGS GNIS
13.6m non-US places from NGA GNS.

`make infoboxes` (2.5GB, 3800+ categories, 12 hours; updated monthly)

4m Wikipedia infoboxes organized by type, in JSONL format

See results immediately as they accumulate in output/wp-infoboxes.

`make tpch` (500MB, 8 tables, 20 seconds; generated randomly)

TPC-H data generated with DuckDB)

`make fakedata` (13MB, 3 tables, 5 seconds; generated randomly)

generated with Faker
joinable products, customers, and orders tables for a fake business
unicode data, including japanese and arabic names and addresses
includes geo lat/long coords, numeric arrays, and arrays of structs

Supported output formats

Specify with -f <formats> to individual scripts. Separate multiple formats by ,. All available formats will be output by default.

Apache Parquet: parquet
Apache Arrow IPC format: arrow and arrows
DuckDB: duckdb
SQLite: sqlite

Scripts

These live in the scripts/ directory. Some of them require the readysetdata module to be installed. For the moment, set PYTHONPATH=. and run from the toplevel directory.

`remote-unzip.py <url> <filename>`

Extract <filename> from .zip file at <url>, and stream to stdout. Only downloads the one file; does not need to download the entire .zip.

`download.py <url>`

Download from <url> and stream to stdout. The data for e.g. https://example.com/path/to/file.csv will be cached at cache/example.com/path/to/file.csv.

`xml2json.py <tag>`

Parse XML from stdin, and emit JSONL to stdout for the given <tag>.

`demux-jsonl.py <field>`

Parse JSONL from stdin, and append each JSONL verbatim to its <field-value>.jsonl.

Credits

Created and curated by Saul Pwanson. Licensed for use under Apache 2.0.

Enabled by Apache Arrow and Voltron Data.

dougb5 / readysetdata

Ready, Set, Data!

Features

Setup

Datasets

`make movielens` (150MB, 3 tables, 5 minutes) (2019)

`make imdb` (20GB, 7 tables, 1 hour; updated daily)

`make geonames` (500MB, 2 tables, 10 minutes; updated quarterly)

`make infoboxes` (2.5GB, 3800+ categories, 12 hours; updated monthly)

`make tpch` (500MB, 8 tables, 20 seconds; generated randomly)

`make fakedata` (13MB, 3 tables, 5 seconds; generated randomly)

Supported output formats

Scripts

`remote-unzip.py <url> <filename>`

`download.py <url>`

`xml2json.py <tag>`

`demux-jsonl.py <field>`

Credits

About

Languages

Ready, Set, Data!

Features

Setup

Datasets

make movielens (150MB, 3 tables, 5 minutes) (2019)

make imdb (20GB, 7 tables, 1 hour; updated daily)

make geonames (500MB, 2 tables, 10 minutes; updated quarterly)

make infoboxes (2.5GB, 3800+ categories, 12 hours; updated monthly)

make tpch (500MB, 8 tables, 20 seconds; generated randomly)

make fakedata (13MB, 3 tables, 5 seconds; generated randomly)

Supported output formats

Scripts

remote-unzip.py <url> <filename>

download.py <url>

xml2json.py <tag>

demux-jsonl.py <field>

Credits

About

Languages

`make movielens` (150MB, 3 tables, 5 minutes) (2019)

`make imdb` (20GB, 7 tables, 1 hour; updated daily)

`make geonames` (500MB, 2 tables, 10 minutes; updated quarterly)

`make infoboxes` (2.5GB, 3800+ categories, 12 hours; updated monthly)

`make tpch` (500MB, 8 tables, 20 seconds; generated randomly)

`make fakedata` (13MB, 3 tables, 5 seconds; generated randomly)

`remote-unzip.py <url> <filename>`

`download.py <url>`

`xml2json.py <tag>`

`demux-jsonl.py <field>`