KEWER: Knowledge graph Entity and Word Embedding for Retrieval

The repository contains code and data for ECIR 2020 paper "Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph" [pdf, slides, presentation].

KEWER embeddings trained on categories, literals, predicates structural components and unigram probabilities are available here: https://academictorrents.com/details/4778f904ca10f059eaaf27bdd61f7f7fc93abc6e.

Entity Retrieval example

KEWER allows to significantly improve entity retrieval for complex queries. Below are the top 10 results for the query "wonders of the ancient world" obtained using BM25F and KEWER. Relevant results are italicized, and highly relevant results are boldfaced.

BM25F	KEWER
Seven Wonders of the Ancient World	Colossus of Rhodes
7 Wonders of the Ancient World (video game)	Statue of Zeus at Olympia
Wonders of the World	Temple of Artemis
Seven Ancient Wonders	List of archaeoastronomical sites by country
The Seven Fabulous Wonders	Hanging Gardens of Babylon
The Seven Wonders of the World (album)	Antikythera mechanism
Times of India's list of seven wonders of India	Timeline of ancient history
Lighthouse of Alexandria	Wonders of the World
7 Wonders (board game)	Lighthouse of Alexandria
Colossus of Rhodes	Great Pyramid of Giza

Download dataset

To download the dataset, which is a subset of English DBpedia 2015-10, simply run make-dataset.sh script. Verify that it produced the following files and directories in dbpedia-2015-10-kewer directory:

$ tree --dirsfirst dbpedia-2015-10-kewer
dbpedia-2015-10-kewer
├── graph
│   ├── infobox_properties_en.ttl
│   ├── mappingbased_literals_en.ttl
│   └── mappingbased_objects_en.ttl
├── labels
│   ├── anchor_text_en.ttl
│   ├── category_labels_en.ttl
│   ├── dbpedia_2015-10.nt
│   ├── infobox_property_definitions_en.ttl
│   └── labels_en.ttl
├── article_categories_en.ttl
├── short_abstracts_en.ttl
└── transitive_redirects_en.ttl

2 directories, 11 files

Train KEWER embeddings

Generate indexed file with the filtered entities: make-indexed.sh.
Install required packages:

$ conda create --name kewer --file requirements.txt
$ conda activate kewer

Train embeddings:

$ cd embeddings/KEWER
$ ./gen_graph.py
$ ./gen_walks.py --cat --outfile data/walks-cat.txt
$ ./replace_uris.py --pred --lit --infile data/walks-cat.txt --outfile data/sents-cat-pred-lit.txt
# optional - shuffle sentences: $ shuf data/sents-cat-pred-lit.txt -o data/sents-cat-pred-lit.txt
$ ./train_w2v.py --infile data/sents-cat-pred-lit.txt --outfiles data/kewer

Repo structure

bm25f/ scripts to optimize and run retrieval of BM25F baseline using Galago fork https://sourceforge.net/projects/galago-fork/. You need to provide index in index/ directory to run the scripts. For converting .ttl files into trecweb format that can be indexed with Galago, this project can be used https://github.com/teanalab/dbpedia2fields
embeddings/ scripts to train KEWER and Jointly baseline.
entity-extraction/ scripts to perform entity linking in queries using DBpedia Spotlight, Nordlys LTR, and SMAPH.
interpolation-el/ interpolation BM25F+KEWER_el-SM.
interpolation/ interpolation BM25F+KEWER.
qrels/ relevance judgments from DBpedia-Entity v2.
queries/ query folds in json and tsv formats.
retrieval/ ranking of entities using embeddings only, without interpolation, as in Table 2 in paper.
word2vec/ scripts for BM25F+word2vec baseline.
eval.sh evaluate result runs using provided qrels_file.
make-dataset.sh download DBpedia 2015-10 dataset.
make-indexed.sh generate 'indexed' file with the filtered entities.
queries-v2_stopped.txt DBpedia-Entity v2 queries.

Cite

@InProceedings{Nikolaev:2020:KEWER,
  author="Nikolaev, Fedor and Kotov, Alexander",
  title="Joint Word and Entity Embeddings for Entity Retrieval from a Knowledge Graph",
  booktitle="Advances in Information Retrieval",
  year="2020",
  publisher="Springer International Publishing",
  address="Cham",
  pages="141--155",
  isbn="978-3-030-45439-5"
}

Contact

If you have any questions or suggestions, send an email to fedor@wayne.edu or create a GitHub issue.

teanalab / kewer