victoriastuart / GRAM-CNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GRAM-CNN

GRAM-CNN is a novel end-to-end approach for biomedical NER tasks. To automatically label a word, this method uses the local information around the word. Therefore, the GRAM-CNN method doesn't require any specific knowledge or feature engineering and can be theoretically applied to all existing NER problems. \ The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.38% on the Biocreative II dataset, 86.65% on the NCBI dataset, and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods.

Pre-trained embedding are from:
https://github.com/cambridgeltl/BioNLP-2016
Some code (loader.py and utils.py) are adopted from:
https://github.com/glample/tagger
https://github.com/carpedm20/lstm-char-cnn-tensorflow/blob/master/models/TDNN.py
The examples have to be run from the src repository.
Source code for the paper:


Requirements:


Datasets (in dataset folder):

Train GRAMCNN example:

> python train.py --train ../dataset/NLPBA/train/train.eng --dev ../dataset/NLPBA/train/dev.eng --test ../dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb ../embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -W 100 -H 1 -D 0.5 --lower 1 -A 0 --tag_scheme iob -P 0 -S 0 -w 200 -K 2,3,4 -k 40,40,40
  • This will train a one layer Bi-directional LSTM network with hidden size 100 and drop out ratio 0.5, -P set to 0 means that use LSTM
> python train.py --train dataset/NLPBA/train/train.eng --dev dataset/NLPBA/train/dev.eng --test dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -D 0.5 -A 0 -W 675 -w 200 -H 7 --lower 1 -K 2,3,4 -k 40,40,40 -P 1 -S 0 --tag_scheme iob
  • Set -p 1, this will train GRAMCNN network, -W and -H has no meaning here, drop out ratio is 0.5
  • Detailed parameters setting are in src/train.py
> python train.py --help

Infer GRAMCNN example:

  • To test the pre-trained model, just replace train.py with infer.py
  • The result output file is in the evaluation repository.
> python infer.py --train ../dataset/NLPBA/train/train.eng --dev ../dataset/NLPBA/train/dev.eng --test ../dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb ../embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -W 675 -H 12 -D 0.5 --lower 1 -A 0 --tag_scheme iob -P 1 -S 1 -w 200 -K 2,3,4 -k 40,40,40

Example pre-trained model:

JNLPBA:

  • use_wordTrue use_charTrue drop_out0.5 hidden_size675 hidden_layer12 lowerTrue allembFalse kernels2, 3, 4 num_kernels40, 40, 40 paddingTrue ptsTrue w_emb200
  • Result image: alt text

Run the example model in NCBI and BC2, please change kernel range in n_gram.py to [1,10]

About


Languages

Language:Python 72.4%Language:Perl 27.6%