Analyzing MAUDE with Snorkel

This repo contains code developed as part of a collaboration with the International Consortium of Investigative Journalists (ICIJ).

I. Labeling Functions

This script loads a TSV of MAUDE records and applies al labeling functions.

python preprocess.py \
	--outdir results/gender/ \
	--train data/MaudeFull8M.tsv \
	--chunksize 1000000 \
	--n_procs 36

This takes ~96 minutes to run on 8M records using 36 CPU cores.

All documents are derrived from the FDA's public MAUDE database. All datasets can be downloaded with:

./download.sh

MaudeSample20k 16MB. Uniform random sample of 20k records. data/MaudeSample20k.tsv
MaudeSample500k 400MB. Uniform random sample of 500k records. data/MaudeSample500k.tsv
MaudeSample2M 1.5GB. Uniform random sample of 2M records. data/MaudeSample2M.tsv

All documents are labeled with y ∈ {MALE, FEMALE, UNK}

Uniform random sample of 1000 docs from MaudeSample2M data/labels/1k.sample.seed_1234.FINAL.tsv.bz2 1MB. 1 annotator.
Gender terms query sample of 1000 docs from MaudeSample2M data/labels/1k.query_sample.seed_1234.FINAL.tsv.bz2 1MB. 1 annotator.

These are model predicted labels (either the majority vote of LFs or an end model such as BERT).

All files are on Dropbox here.