Ready, Set, Data!
A collection of interesting datasets and the tools to convert them into ready-to-use formats.
Features
- curated and cleaned datasets: quality over quantity
- all tools and pipelines are streaming: first results are available immediately
- fields and units are clearly labeled and properly-typed
- data is output in immediately usable formats (Parquet, Arrow, DuckDB, SQLite)
- datasets conform to reasonable standards (UTF-8, RFC3339 dates, decimal lat/long coords, SI units)
Setup
Requires Python 3.8+.
git clone https://github.com/saulpw/readysetdata.git
cd readysetdata
Then from within the repository,
make setup
or
pip install .
or
python3 setup.py install
Datasets
Output is generated for all available formats and put in the OUTPUT
directory (output/
by default).
Size and time estimates are for JSONL output on a small instance.
make movielens
(150MB, 3 tables, 5 minutes) (2019)
- 84k movies and 28m ratings from MovieLens
make imdb
(20GB, 7 tables, 1 hour; updated daily)
- 9m movies/tv (1m rated), 7m tv episodes, 12m people from imdb.
make geonames
(500MB, 2 tables, 10 minutes; updated quarterly)
make infoboxes
(2.5GB, 3800+ categories, 12 hours; updated monthly)
- 4m Wikipedia infoboxes organized by type, in JSONL format
See results immediately as they accumulate in output/wp-infoboxes
.
make tpch
(500MB, 8 tables, 20 seconds; generated randomly)
make fakedata
(13MB, 3 tables, 5 seconds; generated randomly)
- generated with Faker
- joinable products, customers, and orders tables for a fake business
- unicode data, including japanese and arabic names and addresses
- includes geo lat/long coords, numeric arrays, and arrays of structs
Supported output formats
Specify with -f <formats>
to individual scripts. Separate multiple formats by ,
. All available formats will be output by default.
- Apache Parquet:
parquet
- Apache Arrow IPC format:
arrow
andarrows
- DuckDB:
duckdb
- SQLite:
sqlite
Scripts
These live in the scripts/
directory. Some of them require the readysetdata
module to be installed. For the moment, set PYTHONPATH=.
and run from the toplevel directory.
remote-unzip.py <url> <filename>
Extract <filename>
from .zip file at <url>
, and stream to stdout. Only downloads the one file; does not need to download the entire .zip.
download.py <url>
Download from <url>
and stream to stdout. The data for e.g. https://example.com/path/to/file.csv
will be cached at cache/example.com/path/to/file.csv
.
xml2json.py <tag>
Parse XML from stdin, and emit JSONL to stdout for the given <tag>
.
demux-jsonl.py <field>
Parse JSONL from stdin, and append each JSONL verbatim to its <field-value>.jsonl
.
Credits
Created and curated by Saul Pwanson. Licensed for use under Apache 2.0.
Enabled by Apache Arrow and Voltron Data.