GRAM-CNN is a novel end-to-end approach for biomedical NER tasks. To automatically label a word, this method uses the local information around the word. Therefore, the GRAM-CNN method doesn't require any specific knowledge or feature engineering and can be theoretically applied to all existing NER problems. \ The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.38% on the Biocreative II dataset, 86.65% on the NCBI dataset, and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods.
Pre-trained embedding are from:
https://github.com/cambridgeltl/BioNLP-2016
Some code (loader.py and utils.py) are adopted from:
https://github.com/glample/tagger
https://github.com/carpedm20/lstm-char-cnn-tensorflow/blob/master/models/TDNN.py
The examples have to be run from the src repository.
Source code for the paper:
- Tensorflow 1.0.0 : pip install tensorflow
- gensim 0.13.2 : pip install gensim==0.13.2
- numpy : pip install numpy
- python2.7
- pre-trained embedding: download from https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 and put it into embeddings folder
- Biocreative II (http://biocreative.sourceforge.net/biocreative_2_dataset.html)
- NCBI (https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/)
- JNLPBA (http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html)
> python train.py --train ../dataset/NLPBA/train/train.eng --dev ../dataset/NLPBA/train/dev.eng --test ../dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb ../embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -W 100 -H 1 -D 0.5 --lower 1 -A 0 --tag_scheme iob -P 0 -S 0 -w 200 -K 2,3,4 -k 40,40,40
- This will train a one layer Bi-directional LSTM network with hidden size 100 and drop out ratio 0.5, -P set to 0 means that use LSTM
> python train.py --train dataset/NLPBA/train/train.eng --dev dataset/NLPBA/train/dev.eng --test dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -D 0.5 -A 0 -W 675 -w 200 -H 7 --lower 1 -K 2,3,4 -k 40,40,40 -P 1 -S 0 --tag_scheme iob
- Set -p 1, this will train GRAMCNN network, -W and -H has no meaning here, drop out ratio is 0.5
- Detailed parameters setting are in src/train.py
> python train.py --help
- To test the pre-trained model, just replace train.py with infer.py
- The result output file is in the evaluation repository.
> python infer.py --train ../dataset/NLPBA/train/train.eng --dev ../dataset/NLPBA/train/dev.eng --test ../dataset/NLPBA/test/Genia4EReval1.iob2 --pre_emb ../embeddings/bio_nlp_vec/PubMed-shuffle-win-30.bin -W 675 -H 12 -D 0.5 --lower 1 -A 0 --tag_scheme iob -P 1 -S 1 -w 200 -K 2,3,4 -k 40,40,40
JNLPBA:
Run the example model in NCBI and BC2, please change kernel range in n_gram.py to [1,10]