1. DeepEventMine

A deep leanring model to predict named entities, triggers, and nested events from biomedical texts.

The model and results are reported in our paper:

DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts, Bioinformatics, 2020.

1.1. Features

Based on pre-trained BERT
Predict nested entities and nested events
Provide our trained models on the seven biomedical tasks
Reproduce the results reported in our Bioinformatics paper
Predict for new data given raw text input or PubMed ID
Visualize the predicted entities and events on the brat

1.2. Tasks

DeepEventMine has been trained and evaluated on the following tasks (six BioNLP shared tasks and MLEE).

cg: Cancer Genetics (CG), 2013
ge11: GENIA Event Extraction (GENIA), 2011
ge13: GENIA Event Extraction (GENIA), 2013
id: Infectious Diseases (ID), 2011
epi: Epigenetics and Post-translational Modifications (EPI), 2011
pc: Pathway Curation (PC), 2013
mlee: Multi-Level Event Extraction (MLEE)

1.3. Our trained models and scores

2. Preparation

2.1. Requirements

Python 3.6.5
PyTorch (torch==1.1.0 torchvision==0.3.0, cuda92)

virtualenv -p python3 pytorch-env
source pytorch-env/bin/activate
export CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/usr/local/cuda pip install torch==1.1.0 torchvision==0.3.0

Install Python packages

sh install.sh

2.2. BERT

Download SciBERT BERT model from PyTorch AllenNLP

sh download.sh bert

2.3. DeepEventMine

Download pre-trained DeepEventMine model on a given task
[task] = cg (or pc, ge11, epi, etc)

sh download.sh deepeventmine [task]

2.4 Brat

To visualize the output using the brat
Download brat v1.3

sh download.sh brat

Install brat based on the brat instructions

cd brat/brat-v1.3_Crunchy_Frog/
./install.sh -u
python2 standalone.py

3. Predict (BioNLP tasks)

3.1. Prepare data

Download corpora

To download the original data sets from BioNLP shared tasks.
[task] = cg, pc, ge11, etc

sh download.sh bionlp [task]

Preprocess data

Tokenize texts and prepare data for prediction

sh preprocess.sh bionlp

Generate configs

If using GPU: [gpu] = 0, otherwise: [gpu] = -1
[task] = cg, pc, etc

sh run.sh config [task] [gpu]

3.2. Predict

For development and test sets (given gold entities)

CG task: [task] = cg
PC task: [task] = pc
Similarly for: ge11, ge13, epi, id, mlee

sh run.sh predict [task] gold dev
sh run.sh predict [task] gold test

Check the output in the path

experiments/[task]/predict-gold-dev/
experiments/[task]/predict-gold-test/

3.3. Evaluate

Retrieve the original offsets and create zip format

sh run.sh offset [task] gold dev
sh run.sh offset [task] gold test

Submit the zipped file to the shared task evaluation sites:

Evaluate events

Evaluate event prediction for PC and CG tasks on the development sets using the shared task scripts.
Evaluation options: s (softboundary), p(partialrecursive)

sh run.sh eval [task] gold dev sp

4. End-to-end

4.1. Input: a single PMID or PMCID

Abstract

sh pubmed.sh e2e pmid 1370299 cg 0

Full text

sh pubmed.sh e2e pmcid PMC4353630 cg 0

Input: PMID: 1370299, PMCID: PMC4353630 (a single PubMed ID to get raw text)
Model to predict: DeepEventMine trained on cg (Cancer Genetics 2013), (other options: pc, ge11, etc)
GPU: 0 (if CPU: -1)
Output: in brat format and brat visualization

T24	Organism 1248 1254	bovine
T25	Gene_or_gene_product 1255 1259	u-PA
T55	Positive_regulation 1107 1116	increased
T57	Localization 1170 1179	migration
T58	Negative_regulation 1260 1267	blocked
...

T23	Gene_or_gene_product 1184 1188	u-PA
T56	Positive_regulation 1157 1166	increases
E9	Positive_regulation:T56 Theme:T23

T26	Gene_or_gene_product 1320 1325	c-src
T62	Gene_expression 1326 1336	expression
E10	Gene_expression:T62 Theme:T26

T61	Positive_regulation 1310 1319	increased
E24	Positive_regulation:T61 Theme:E10

4.2. Input: a list of PMIDs

Given an arbitrary name for your raw text data, for example "my-pubmed"
Prepare a list of PMID and PMCID in the path

data/my-pubmed/pmid.txt

sh pubmed.sh e2e pmids my-pubmed cg 0

4.3. Input: raw text files

Given an arbitrary name for your raw text data, for example "my-pubmed"
Prepare your raw text files in the path

data/my-pubmed/text/PMID-*.txt
data/my-pubmed/text/PMC-*.txt

sh pubmed.sh e2e rawtext my-pubmed cg 0

5. Predict for new data (step-by-step)

Input: your own raw text or PubMed ID
Output: predicted entities and events in brat format

5.1. Raw text

Given an arbitrary name for your raw text data, for example "my-pubmed"
Prepare your own raw text in the following path

data/my-pubmed/text/PMID-*.txt
data/my-pubmed/text/PMC-*.txt

5.2. PubMed ID

Or, you can automatically get raw text given PubMed ID or PMC ID

Get raw text

PubMed ID list

In order to get full text given PMC ID, the text should be available in ePub (for our current version).
Prepare your list of PubMed ID and PMC ID in the path

data/my-pubmed/pmid.txt

Get text from the PubMed ID

sh pubmed.sh pmids my-pubmed

PubMed ID

You can also get text by directly input a PubMed or PMC ID

sh pubmed.sh pmid 1370299
sh pubmed.sh pmcid PMC4353630

Preprocess

sh pubmed.sh preprocess my-pubmed

5.3. Predict

Generate config

Generate config for prediction
The data name to predict: my-pubmed
The trained model used for predict: cg (or pc, ge11, etc)
If you use gpu [gpu]=0, otherwise [gpu]=-1

sh pubmed.sh config my-pubmed cg 0

Predict

sh pubmed.sh predict my-pubmed

Retrieve the original offsets

sh pubmed.sh offset my-pubmed

Check the output in

experiments/my-pubmed/results/ev-last/my-pubmed-brat

6. Visualization

6.1. Prepare data

Copy the predicted data into the brat folder to visualize
For the raw text prediction:

sh pubmed.sh brat my-pubmed cg

Or for the shared task

sh run.sh brat [task] gold dev
sh run.sh brat [task] gold test

6.2. Visualize

The data to visualize is located in

brat/brat-v1.3_Crunchy_Frog/data/my-pubmed-brat
brat/brat-v1.3_Crunchy_Frog/data/[task]-brat

7. Acknowledgements

This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion PrograM).

8. Citation

@article{10.1093/bioinformatics/btaa540,
    author = {Trieu, Hai-Long and Tran, Thy Thy and Duong, Khoa N A and Nguyen, Anh and Miwa, Makoto and Ananiadou, Sophia},
    title = "{DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts}",
    journal = {Bioinformatics},
    year = {2020},
    month = {06},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa540},
    url = {https://doi.org/10.1093/bioinformatics/btaa540},
    note = {btaa540},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa540/33399046/btaa540.pdf},
}

dnanhkhoa / DeepEventMine