hieuqtran / icij-maude

Weakly supervised classification of adverse event reports from the FDA's MAUDE database.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analyzing MAUDE with Snorkel

This repo contains code developed as part of a collaboration with the International Consortium of Investigative Journalists (ICIJ).

I. Labeling Functions

This script loads a TSV of MAUDE records and applies al labeling functions.

python preprocess.py \
	--outdir results/gender/ \
	--train data/MaudeFull8M.tsv \
	--chunksize 1000000 \
	--n_procs 36

This takes ~96 minutes to run on 8M records using 36 CPU cores.

II. Datasets

All documents are derrived from the FDA's public MAUDE database. All datasets can be downloaded with:

./download.sh

1. Unlabeled Documents

  • MaudeSample20k 16MB. Uniform random sample of 20k records. data/MaudeSample20k.tsv
  • MaudeSample500k 400MB. Uniform random sample of 500k records. data/MaudeSample500k.tsv
  • MaudeSample2M 1.5GB. Uniform random sample of 2M records. data/MaudeSample2M.tsv

2. GENDER Labels

All documents are labeled with y ∈ {MALE, FEMALE, UNK}

3. Model-generated Labels

These are model predicted labels (either the majority vote of LFs or an end model such as BERT).

All files are on Dropbox here.

  • GENDER MV_2M_MAUDE_38_LFs_2019-6-21.tsv.bz2 15MB.
  • GENDER MV_DEATH_INJURY_MAUDE_38_LFs_2019-6-25.tsv.bz2 23MB.

About

Weakly supervised classification of adverse event reports from the FDA's MAUDE database.


Languages

Language:Python 97.8%Language:Shell 2.2%