lm-datasets

lm-datasets is a collection of datasets for language model training including scripts for downloading, preprocesssing, and sampling.

Installation

pip install lm-datasets

Usage

To download and extract the plain-text of one or more datasets, run the following command:

python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR

By default, output is saved as JSONL files. To change the output format, you can use the --output_format argument as below:

python -m lm_datasets.extract_plaintext $DATASET_ID $OUTPUT_DIR --output_format parquet  --output_compression zstd

Available datasets

A list or table with all available datasets can be print with the follow command:

python -m lm_datasets.print_stats --print_output md

Token count by language

Language	Tokens
bg	53 B
ca	5 B
code	250 B
cs	128 B
da	34 B
de	795 B
el	108 B
en	6 T
es	674 B
et	15 B
eu	696 M
fi	55 B
fr	655 B
ga	767 M
gl	70 M
hr	8 B
hu	179 B
it	386 B
lt	24 B
lv	14 B
mt	4 B
nl	238 B
nn	307 M
no	9 B
pl	223 B
pt	187 B
ro	77 B
sh	2 M
sk	47 B
sl	11 B
sr	10 B
sv	89 B
uk	47 B

Token count by source

Source	Tokens
academic_slovene_kas	1 B
bgnc_admin_eur	79 M
bgnc_news_corpus	18 M
brwac	3 B
bulgarian_news	283 M
bulnc	567 M
cabernet	712 M
cc_gigafida	127 M
colossal_oscar	208 B
croatian_news_engri	695 M
curlicat	410 M
danewsroom	472 M
danish_gigaword	1 B
dewac	2 B
dialogstudio	0
dk_clarin	441 M
enc2021	0
estonian_reference_corpus	175 M
eurlex	121 B
euscrawl	423 M
ga_bilingual_legistation	4 M
ga_universal_dependencies	3 M
greek_legal_code	45 M
greek_web_corpus	3 B
hrwac	1 B
itwac	2 B
korpus_malti	366 M
legal_mc4	29 B
macocu	23 B
marcell_legislative_subcorpus_v2	31 M
norwegian_cc	5 B
opengptx	26 B
openlegaldata	10 B
oscar	9 T
oscar_opengptx	245 B
parlamento_pt	819 M
pes2o	42 B
pl_nkjp	1 M
pl_parliamentary_corpus	671 M
proof_pile	8 B
redpajama	46 B
seimas_lt_en	48 k
sk_court_decisions	11 B
sk_laws	45 M
slwac_web	1 B
sonar	500 M
sonar_new_media	36 M
spanish_legal	3 B
srpkor	0
starcoder	250 B
state_related_latvian_web	1 M
styria_news	409 M
sv_gigaword	1 B
syn_v9	5 B
uk_laws	579 M
wiki	12 B
wikibooks	353 M
wikihow	2 M
wikinews	79 M
wikiquote	268 M
wikisource	2 B
wikivoyage	132 M
ylenews	0

Dataset viewer

We provide a Web-based application through streamlit to browse all datasets and their contained text content. To start the app, run the following command:

streamlit viewer/app.py $RAW_DATASETS_DIR $PROCESSED_DATASET_DIR

Development & Contributions

Setup environment

git clone git@github.com:malteos/lm-datasets.git
cd lm-datasets

conda create -n lm-datasets python=3.10
conda activate lm-datasets

pip install -r requirements.txt

Install the pre-commit hooks

This repository uses git hooks to validate code quality and formatting.

pre-commit install
git config --bool flake8.strict true  # Makes the commit fail if flake8 reports an error

To run the hooks:

pre-commit run --all-files

Testing

The tests can be executed with:

pytest --doctest-modules --cov-report term --cov=lm_datasets

License

Apache 2.0

(Please note that the actual datasets are released with different licenses)

About

Apache License 2.0

Languages

Language:Python 100.0%