Large-scale entity and entity pair embeddings

This repository contains source code to learn dense semantic representations for biomedical entities and pairs of entities as used in Sänger and Leser: "Large-scale Entity Representation Learning for Biomedical Relationship Extraction" (Bioinformatics, 2020).

The approach aims to perform biomedical relation extraction on corpus-level based on entity and entity pair embeddings learned on the complete PubMed corpus. For this we use focus on all articles mentioning a certain biomedical entity (e.g. mutation V600E) or pair of entities within the article title or abstract. We concatenate all articles mention the entity / entity pair and apply paragraph vectors (Le and Mikolov, 2014) to learn an embedding for each distinct entity resp. pair of entities.

Usage

The implementation of the embeddings is based on Gensim. The following snippet highlights the basic use of the pre-trained embeddings.

from gensim.models import KeyedVectors

# Loading pre-trained entity model
model = KeyedVectors.load("mutation-v0500.bin")

# Print number of distinct entities of the model
print(f"Distinct entities: {len(model.vocab)}\n")

# Get the embedding for an specific entity
entity_embedding = model["rs113488022"]
print(f"Embedding of rs113488022:\n{entity_embedding}\n")

# Find similar entities
print("Most similar entities to rs113488022:")
top5_nearest_neighbors = model.most_similar("rs113488022", topn=5)
for i, (entity_id, sim) in enumerate(top5_nearest_neighbors):
    print(f" {i+1}: {entity_id} (similarity: {sim:.3f})")

This should output:

Distinct entities: 47498

Embedding of rs113488022:
[ 1.15715809e-01  4.90018785e-01 -6.05004542e-02 -8.35603476e-02
  9.20398310e-02 -1.51171118e-01  4.01901715e-02 -2.36775234e-01
  ...
]

Most similar entities to rs113488022:
 1: rs121913227 (similarity: 0.690)
 2: rs121913364 (similarity: 0.628)
 3: rs121913529 (similarity: 0.610)
 4: rs121913357 (similarity: 0.573)
 5: rs11554290 (similarity: 0.571)

Pre-trained Entity Embeddings

Entity Type	Identifier	#Entities	Vocabulary	v500	v1000	v1500	v2000
Cellline	Cellosaurus ID	4,654	Vocab	Vectors	Vectors	Vectors	Vectors
Chemical	MeSH	109,716	Vocab	Vectors	Vectors	Vectors	Vectors
Disease	MeSH	10,712	Vocab	Vectors	Vectors	Vectors	Vectors
	DOID	3,157	Vocab	Vectors	Vectors	Vectors	Vectors
Drug	Drugbank ID	5,966	Vocab	Vectors	Vectors	Vectors	Vectors
Gene	NCBI Gene ID	171,686	Vocab	Vectors	Vectors	Vectors	Vectors
Mutation	RS-Identifier	47,498	Vocab	Vectors	Vectors	Vectors	Vectors
Species	NCBI Taxonomy	176,989	Vocab	Vectors	Vectors	Vectors	Vectors

Train your own embeddings

For the computing entity and entity pair embeddings we utilize the complete PubMed corpus and make use of the data and entity annotations provided by PubTator Central.

Download resources

Download annotations from PubTator Central:

python download_resources.py --resources pubtator_central

Note: The annotation data requires > 70GB of disk space.

Learn entity embeddings

Learning entity embeddings can be done in two steps:

Prepare entity annotations:

python prepare_entity_dataset.py --working_dir _out --entity_type mutation

We support entity types cell line, chemical, disease, drug, gene, mutation, and species.

Run representation learning:

python learn_embeddings.py --input_file _out/mutation/doc2vec_input.txt \
                           --config_file ../resources/configurations/doc2vec-0500.config \
                           --model_name mutation-v0500 \
                           --output_dir _out/mutation

Example configurations can be found in resources/configurations.

Learn entity pair embeddings

To learn entity pair embeddings, preparation of the entity annotations has to be performed first (see above). Analogously to the entity embeddings, learning of pair embeddings is performed in two steps:

Prepare pair annotations:

python prepare_pair_dataset.py --working_dir _out --source_type mutation --target_type disease

We support entity types disease, drug, and mutation.

Run representation learning:

python learn_embeddings.py --input_file _out/mutation-disease/doc2vec_input.txt \
                           --config_file ../resources/configurations/doc2vec-0500.config \
                           --model_name mutation-disease-v0500 \
                           --output_dir _out/mutation-disease

Example configurations can be found in resources/configurations.

Supported entity types

Entity Type	Identifier	Example
Cell line	Cellosaurus ID	CVCL:0027 (Hep-G2)
Chemical	MeSH	MESH:D000068878 (hTrastuzumab)
Disease	MeSH	MESH:D006984 (hypertrophic chondrocytes)
	Disease Ontology ID (DOID) ¹	DOID:60155 (visual agnosia)
Drug	Drugbank ID	DB00166 (lipoic acid)
Gene	NCBI Gene ID	NCBI:673 (BRAF)
Mutation	RS-Identifier	rs113488022 (V600E)
Species	NCBI Taxonomy	TAXON:9606 (human)

1: Use option "--entity_type disease-doid" when calling prepare_entity_dataset.py to normalize disease annotations to the Disease Ontology.

Citation

Please use the following bibtex entry to cite our work:

@article{saenger2020entityrepresentation,
  title={Large-scale Entity Representation Learning for Biomedical Relationship Extraction},
  author={S{\"a}nger, Mario and Leser, Ulf},
  journal={Bioinformatics},
  year={2020},
  publisher={Oxford University Press}
}

Acknowledgements

We use the annotations from PubTator Central to compute the entity embeddings. For further details see here and refer to:

Wei, Chih-Hsuan, et al. "PubTator central: automated concept annotation for biomedical full text articles." Nucleic acids research 47.W1 (2019): W587-W593.
We use information from the Disease Ontology to normalize disease annotations. For further details see here and refer to:

Schriml, Lynn M., et al. "Human Disease Ontology 2018 update: classification, content and workflow expansion." Nucleic acids research 47.D1 (2019): D955-D962.
We use the paragraph vectors model to perform entity representation learning. For further details see here and refer to:

Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

About

Large-scale biomedical relation extraction with entity and pair embeddings

Languages

Language:Python 100.0%