This repo contains code developed as part of a collaboration with the International Consortium of Investigative Journalists (ICIJ).
This script loads a TSV of MAUDE records and applies al labeling functions.
python preprocess.py \
--outdir results/gender/ \
--train data/MaudeFull8M.tsv \
--chunksize 1000000 \
--n_procs 36
This takes ~96 minutes to run on 8M records using 36 CPU cores.
All documents are derrived from the FDA's public MAUDE database. All datasets can be downloaded with:
./download.sh
- MaudeSample20k 16MB. Uniform random sample of 20k records.
data/MaudeSample20k.tsv
- MaudeSample500k 400MB. Uniform random sample of 500k records.
data/MaudeSample500k.tsv
- MaudeSample2M 1.5GB. Uniform random sample of 2M records.
data/MaudeSample2M.tsv
All documents are labeled with y ∈ {MALE, FEMALE, UNK}
- Uniform random sample of 1000 docs from MaudeSample2M
data/labels/1k.sample.seed_1234.FINAL.tsv.bz2
1MB. 1 annotator. - Gender terms query sample of 1000 docs from MaudeSample2M
data/labels/1k.query_sample.seed_1234.FINAL.tsv.bz2
1MB. 1 annotator.
These are model predicted labels (either the majority vote of LFs or an end model such as BERT).
All files are on Dropbox here.
- GENDER
MV_2M_MAUDE_38_LFs_2019-6-21.tsv.bz2
15MB. - GENDER
MV_DEATH_INJURY_MAUDE_38_LFs_2019-6-25.tsv.bz2
23MB.