tutubalinaev / EBM-NLP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EBM-NLP

This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals. A sample annotated document looks like:

picture alt

Full annotations are available in ebm_nlp_*.tar.gz, which are organized as follows.

  • documents/ Documents are labeled by their PubMed identification number (PMID). Each document has two files:

    • documents/{PMID}.text Raw text of the abstract
    • documents/{PMID}.tokens Tokenized text to which the labels are assigned
  • annotations/{aggregated|individual}/ Since each document is multiply-annotated, we present two versions of the data:

    • aggregated Recommended - One set of labels per document derived from a voting strategy.
    • individual All labels from each worker (multiply-annotated documents, noisy)
  • .../{starting_spans|hierarchical_labels}/

    • starting_spans/ Labels for P/I/O text spans
    • hierarchical_labels/ Detailed labels for each starting span
  • .../{participants|interventions|outcomes}/ Labels for each P/I/O element are separated since they occasionally overlap (for 3% of tokens). An example of combining them for joint learning can be found in https://github.com/bepnye/EBM-NLP/tree/master/models/lstm-crf

The label mappings for each PIO element are:

label P I O
0 No label No label No label
1 Age Surgical Physical
2 Sex Physical Pain
3 Sample size Drug Mortality
4 Condition Educational Adverse effects
5 Psychological Mental
6 Other Other
7 Control

About


Languages

Language:Python 99.6%Language:Makefile 0.2%Language:Shell 0.1%