WikipediaParser folder contains the code to extract and prepare the Wikipedia dump.
Please find in the following, link the pre-traind Concept, Word and Entitie vectors (as a result of this project):
-
The Wikipedia ID to Title map file (This file maps the Wikipedia ID of a page to its title): https://web.cs.dal.ca/~sherkat/Files/ID_title_map.zip (210MB)
-
Traind Concepts, Entities and Words: Concepts and entities are presented by their Wikipedia ID.
-
ConVec https://web.cs.dal.ca/~sherkat/Files/WikipediaClean5Negative300Skip10.zip (3.3GB)
-
ConVec Fine Tuned: https://dalu-my.sharepoint.com/personal/eh379022_dal_ca/_layouts/15/guestaccess.aspx?docid=0602754033d8a4e65aa8c841aa2efc491&authkey=AZGkUWbSwjiO4SrsPub0LBI (6.86GB)
-
ConVec Heuristic: https://dalu-my.sharepoint.com/personal/eh379022_dal_ca/_layouts/15/guestaccess.aspx?docid=0b42ef5ba2d3247ccab2c10d5c1691a47&authkey=AYZLBIfjQLta9aAmSaRzymY (3.75GB)
-
ConVec Only Anchors https://dalu-my.sharepoint.com/personal/eh379022_dal_ca/_layouts/15/guestaccess.aspx?docid=0592fd0392e1f4ec89d84a512e6482d03&authkey=AVpBqVspTRnF0Ra6xedMVIk (3.57GB)
Please cite to the following paper if you used the code, datasets and vector embedings:
@misc{NLDBSherkat2017,
Author = {Ehsan Sherkat, Evangelos Milios},
Title = {Vector Embedding of Wikipedia Concepts and Entities},
Year = {2017},
link = {https://arxiv.org/abs/1702.03470}
Url = {https://doi.org/10.1007/978-3-319-59569-6_50}
doi = {10.1007/978-3-319-59569-6_50}
booktitle= {Natural Language Processing and Information Systems, NLDB 2017}
}