ChemTextMining

Short description of repository

This repository contains code and additional resources for experimenting on disease entities extraction using conditional random fields. Structure is as follows:

webmd_corpus.json is a full corpus.
classification_pipe.py - entry point of program.
vocabularies - directory with used vocabularies in txt format, each line contains one vocabulary entity.
clustered words - directory with words clustered using brown clustering algorithm.
corpus_json - directory with datasets used in experiments.
word2vec - directory with word embeddings
remaining is utility code

Usage:

How to load vectors(to use them in your application):

     w2v = KeyedVectors.load_word2vec_model('patht/to/vectors', binary=True)

Install all the requirements.
```
     pip install -r requirements.txt
```
Also perl is need to be installed
Specify in "features" dictionary current token features from list of available features. Also it's necessary to define context size by setting k_prev(tokens to look before) and k_next(tokens to look forward) and features for each context token(in prev_features and next_features).
Run code:
```
   python classificaton_pipe.py
```

Output is in format:

 ...
 Average exact score:
   precision	recall	fscore

 Average weak score:
   precision	recall	fscore

Note: To use different word embedding vectors need to specify it in crf/features.py in load_w2v_model function.

Citing:

Miftahutdinov, Z., Tutubalina, E., Tropsha, A.: Identifying Disease-related Expressions in Reviews using Conditional Random Fields.

http://www.dialog-21.ru/media/3932/miftahutdinovzshetal.pdf

BibTex:

@inproceedings{miftahutdinov2017,
              title={Identifying Disease-related Expressions in Reviews using Conditional Random Fields},
              author={Miftahutdinov, Zulfat and Tutubalina, Elena and Tropsha, Alexander},
              booktitle={Proceedings of International Conference Dialog},
              volume={1},
              pages={155-167},
              year={2017}
}

Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.

link to paper

BibTex

@article{tutubalina2017using,
        title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
        author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
        journal={Russian Chemical Bulletin},
        volume={66},
        number={11},
        pages={2180--2189},
        year={2017},
        publisher={Springer}

}

tutubalinaev / ChemTextMining

ChemTextMining

Short description of repository

This repository contains code and additional resources for experimenting on disease entities extraction using conditional random fields. Structure is as follows:

Usage:

Note: To use different word embedding vectors need to specify it in crf/features.py in load_w2v_model function.

Citing:

Miftahutdinov, Z., Tutubalina, E., Tropsha, A.: Identifying Disease-related Expressions in Reviews using Conditional Random Fields.

BibTex:

Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.

BibTex

About

Languages