mariosaenger / bio-re-with-entity-embeddings

Large-scale biomedical relation extraction with entity and pair embeddings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large-scale entity and entity pair embeddings

This repository contains source code to learn dense semantic representations for biomedical entities and pairs of entities as used in Sänger and Leser: "Large-scale Entity Representation Learning for Biomedical Relationship Extraction" (Bioinformatics, 2020).

The approach aims to perform biomedical relation extraction on corpus-level based on entity and entity pair embeddings learned on the complete PubMed corpus. For this we use focus on all articles mentioning a certain biomedical entity (e.g. mutation V600E) or pair of entities within the article title or abstract. We concatenate all articles mention the entity / entity pair and apply paragraph vectors (Le and Mikolov, 2014) to learn an embedding for each distinct entity resp. pair of entities.

Content: Usage | Pre-trained Entity Embeddings | Embedding Training | Supported Entity Types | Citation | Acknowledgements |

Usage

The implementation of the embeddings is based on Gensim. The following snippet highlights the basic use of the pre-trained embeddings.

from gensim.models import KeyedVectors

# Loading pre-trained entity model
model = KeyedVectors.load("mutation-v0500.bin")

# Print number of distinct entities of the model
print(f"Distinct entities: {len(model.vocab)}\n")

# Get the embedding for an specific entity
entity_embedding = model["rs113488022"]
print(f"Embedding of rs113488022:\n{entity_embedding}\n")

# Find similar entities
print("Most similar entities to rs113488022:")
top5_nearest_neighbors = model.most_similar("rs113488022", topn=5)
for i, (entity_id, sim) in enumerate(top5_nearest_neighbors):
    print(f" {i+1}: {entity_id} (similarity: {sim:.3f})")

This should output:

Distinct entities: 47498

Embedding of rs113488022:
[ 1.15715809e-01  4.90018785e-01 -6.05004542e-02 -8.35603476e-02
  9.20398310e-02 -1.51171118e-01  4.01901715e-02 -2.36775234e-01
  ...
]

Most similar entities to rs113488022:
 1: rs121913227 (similarity: 0.690)
 2: rs121913364 (similarity: 0.628)
 3: rs121913529 (similarity: 0.610)
 4: rs121913357 (similarity: 0.573)
 5: rs11554290 (similarity: 0.571)

Pre-trained Entity Embeddings

Entity Type Identifier #Entities Vocabulary v500 v1000 v1500 v2000
Cellline Cellosaurus ID 4,654 Vocab Vectors Vectors Vectors Vectors
Chemical MeSH 109,716 Vocab Vectors Vectors Vectors Vectors
Disease MeSH 10,712 Vocab Vectors Vectors Vectors Vectors
DOID 3,157 Vocab Vectors Vectors Vectors Vectors
Drug Drugbank ID 5,966 Vocab Vectors Vectors Vectors Vectors
Gene NCBI Gene ID 171,686 Vocab Vectors Vectors Vectors Vectors
Mutation RS-Identifier 47,498 Vocab Vectors Vectors Vectors Vectors
Species NCBI Taxonomy 176,989 Vocab Vectors Vectors Vectors Vectors

Train your own embeddings

For the computing entity and entity pair embeddings we utilize the complete PubMed corpus and make use of the data and entity annotations provided by PubTator Central.

Download resources

  • Download annotations from PubTator Central:
python download_resources.py --resources pubtator_central

Note: The annotation data requires > 70GB of disk space.

Learn entity embeddings

Learning entity embeddings can be done in two steps:

  • Prepare entity annotations:
python prepare_entity_dataset.py --working_dir _out --entity_type mutation

We support entity types cell line, chemical, disease, drug, gene, mutation, and species.

  • Run representation learning:
python learn_embeddings.py --input_file _out/mutation/doc2vec_input.txt \
                           --config_file ../resources/configurations/doc2vec-0500.config \
                           --model_name mutation-v0500 \
                           --output_dir _out/mutation  

Example configurations can be found in resources/configurations.

Learn entity pair embeddings

To learn entity pair embeddings, preparation of the entity annotations has to be performed first (see above). Analogously to the entity embeddings, learning of pair embeddings is performed in two steps:

  • Prepare pair annotations:
python prepare_pair_dataset.py --working_dir _out --source_type mutation --target_type disease

We support entity types disease, drug, and mutation.

  • Run representation learning:
python learn_embeddings.py --input_file _out/mutation-disease/doc2vec_input.txt \
                           --config_file ../resources/configurations/doc2vec-0500.config \
                           --model_name mutation-disease-v0500 \
                           --output_dir _out/mutation-disease  

Example configurations can be found in resources/configurations.

Supported entity types

Entity Type Identifier Example
Cell line Cellosaurus ID CVCL:0027 (Hep-G2)
Chemical MeSH MESH:D000068878 (hTrastuzumab)
Disease MeSH MESH:D006984 (hypertrophic chondrocytes)
Disease Ontology ID (DOID) 1 DOID:60155 (visual agnosia)
Drug Drugbank ID DB00166 (lipoic acid)
Gene NCBI Gene ID NCBI:673 (BRAF)
Mutation RS-Identifier rs113488022 (V600E)
Species NCBI Taxonomy TAXON:9606 (human)

1: Use option "--entity_type disease-doid" when calling prepare_entity_dataset.py to normalize disease annotations to the Disease Ontology.

Citation

Please use the following bibtex entry to cite our work:

@article{saenger2020entityrepresentation,
  title={Large-scale Entity Representation Learning for Biomedical Relationship Extraction},
  author={S{\"a}nger, Mario and Leser, Ulf},
  journal={Bioinformatics},
  year={2020},
  publisher={Oxford University Press}
}

Acknowledgements

  • We use the annotations from PubTator Central to compute the entity embeddings. For further details see here and refer to:

    Wei, Chih-Hsuan, et al. "PubTator central: automated concept annotation for biomedical full text articles." Nucleic acids research 47.W1 (2019): W587-W593.

  • We use information from the Disease Ontology to normalize disease annotations. For further details see here and refer to:

    Schriml, Lynn M., et al. "Human Disease Ontology 2018 update: classification, content and workflow expansion." Nucleic acids research 47.D1 (2019): D955-D962.

  • We use the paragraph vectors model to perform entity representation learning. For further details see here and refer to:

    Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

About

Large-scale biomedical relation extraction with entity and pair embeddings


Languages

Language:Python 100.0%