We developed a degree-based sampling method to generate 42 alignment-oriented datasets from real-world large-scale KGs, representing different heterogeneities of the original KGs. We selected three state-of-the-art embedding-based entity alignment methods for evaluation and comparison. Furthermore, we observed that multi-mapping relations and literal embedding are the two main obstacles for embedding-based entity alignment and some preliminary solutions were attempted. Specifically, we leveraged several enhanced KG embedding models to handle multi-mapping relations and used word2vec to incorporate literal similarities into embeddings. Our findings indicate that the performance of existing embedding-based methods is influenced by the characteristics of datasets and not all KG embedding models are suitable for entity alignment. Alignment-oriented KG embedding remains to be explored.
We considered the following four aspects to build our datasets: source KG, dataset language, entity size and difference of degree distributions between the extracted datasets and original KGs. We selected three well-known KGs as our sources: DBpedia (2016-10), Wikidata (20160801) and YAGO3. For DBpedia, we also formed two cross-lingual datasets: English-French and English-German. In terms of entity sizes, we sampled two kinds of datasets with 15K and 100K entities, respectively. Each dataset contains two versions, V1 and V2, where V1 approximates the degree distribution of source KG, while V2 fits the doubled average degree. Due to lack of enough prior alignment, we only built V1 for cross-lingual DBP-100K. For each version, three samples were generated to prevent randomness. For each dataset, we have five files:
- ent_links: reference entity alignmet
- triples_1: relation triples of sampled entities in KG1
- triples_2: relation triples of sampled entities in KG2
- attr_triples_1: attribute triples of sampled entities in KG1
- attr_triples_2: attribute triples of sampled entities in KG2
All datasets can be downloaded from Datahub or Dropbox, in which three folders named "_1", "_2" and "_3" denote our three samples.
As shown below, this is an example of degree distributions of source KGs and sampled datasets. The sampled dataset in figure is WDB-WD-15K. The red curve represents the V1 version, and the blue curve represents the V2 version. The solid curve represents the source KG, and dotted curve represents the sampled dataset.
The statistics of the 100K datasets are shown below.
DBP-WD-100K | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V1 | V2 | |||||||||||||||||||
DBpedia | Wikidata | DBpedia | Wikidata | |||||||||||||||||
Relations | S1 | 358 | 216 | 333 | 221 | |||||||||||||||
S2 | 364 | 211 | 333 | 226 | ||||||||||||||||
S3 | 368 | 217 | 347 | 221 | ||||||||||||||||
AVG | 363 | 215 | 338 | 223 | ||||||||||||||||
Attributes | S1 | 463 | 807 | 349 | 740 | |||||||||||||||
S2 | 486 | 791 | 390 | 731 | ||||||||||||||||
S3 | 466 | 783 | 402 | 756 | ||||||||||||||||
AVG | 472 | 794 | 380 | 742 | ||||||||||||||||
Rel. triples | S1 | 257,398 | 226,585 | 497,241 | 503,836 | |||||||||||||||
S2 | 259,100 | 224,863 | 493,865 | 484,209 | ||||||||||||||||
S3 | 269,471 | 237,846 | 519,713 | 517,948 | ||||||||||||||||
AVG | 261,990 | 229,765 | 503,606 | 501,998 | ||||||||||||||||
Attr. triples | S1 | 399,424 | 593,332 | 385,004 | 838,155 | |||||||||||||||
S2 | 398,373 | 587,581 | 397,852 | 830,654 | ||||||||||||||||
S3 | 397,787 | 619,950 | 389,973 | 856,447 | ||||||||||||||||
AVG | 398,528 | 600,288 | 390,943 | 841,752 |
DBP-YG-100K | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
V1 | V2 | |||||||||||||||||||
DBpedia | YAGO | DBpedia | YAGO | |||||||||||||||||
Relations | S1 | 326 | 30 | 311 | 31 | |||||||||||||||
S2 | 358 | 31 | 320 | 31 | ||||||||||||||||
S3 | 337 | 30 | 303 | 31 | ||||||||||||||||
AVG | 340 | 30 | 311 | 31 | ||||||||||||||||
Attributes | S1 | 404 | 24 | 347 | 24 | |||||||||||||||
S2 | 415 | 24 | 335 | 23 | ||||||||||||||||
S3 | 402 | 24 | 343 | 23 | ||||||||||||||||
AVG | 407 | 24 | 342 | 23 | ||||||||||||||||
Rel. triples | S1 | 261,038 | 277,779 | 457,197 | 535,106 | |||||||||||||||
S2 | 281,143 | 318,434 | 443,115 | 522,817 | ||||||||||||||||
S3 | 280,904 | 313,147 | 457,888 | 529,100 | ||||||||||||||||
AVG | 274,362 | 303,120 | 452,733 | 529,008 | ||||||||||||||||
Attr. triples | S1 | 425,648 | 141,936 | 442,973 | 108,338 | |||||||||||||||
S2 | 413,532 | 131,411 | 442,122 | 111,467 | ||||||||||||||||
S3 | 420,947 | 136,464 | 448,000 | 105,639 | ||||||||||||||||
AVG | 420,042 | 136,604 | 444,365 | 108,481 |
DBP(en_fr)-100K-V1 | DBP(en_de)-100K-V1 | ||||
---|---|---|---|---|---|
en | fr | en | de | ||
Relations | S1 | 329 | 257 | 305 | 163 |
S2 | 331 | 254 | 310 | 167 | |
S3 | 331 | 256 | 305 | 169 | |
AVG | 330 | 256 | 307 | 166 | |
Attributes | 332 | 469 | S1 | 360 | 494 |
S2 | 331 | 478 | 361 | 494 | |
S3 | 331 | 480 | 357 | 489 | |
AVG | 221 | 476 | 359 | 492 | |
Rel. triples | S1 | 367,096 | 294,440 | 273,093 | 230,586 |
S2 | 367,190 | 294,378 | 274,256 | 232,439 | |
S3 | 367,328 | 294,471 | 275,022 | 232,364 | |
AVG | 367,205 | 294,430 | 274,124 | 231,796 | |
Attr. triples | S1 | 403,321 | 361,330 | 437,144 | 684,663 |
S2 | 402,443 | 361,648 | 436,472 | 685,318 | |
S3 | 402,764 | 361,788 | 439,633 | 689,150 | |
AVG | 402,843 | 361,589 | 437,750 | 686,377 |
Folder "code" contains two subfolders:
- "comparative_method" contains the code of all comparative methods. The correspondence between code files and the methods are as follows:
- "MTransE.py": MTransE
- "IPTransE.py": IPTransE
- "JAPE.py": JAPE
- "TransD_plus.py": TransD+
- "TransH_plus.py": TransH+
- "TransH_2plus.py": TransH++
- "Label2Vec.py": Label2Vec
- "data_handler" contains the code of our degree-based sampling method.
The code is based on Python 3, Tensorflow, Scipy, Numpy, sklearn.
For running code, you need to modify the training data path and the supervision ratio in code files and then execute python3 "code_file.py". For example, if you want to run MTransE on DBP-WD-15K-V1 with 30% supervision, you should first set the two parameters in the main function of MTransE.py as "../ISWC2018/dbp_wd_15k_V1/" and 0.3, respectively. Then you need to execute python3 MTransE.py. During running, logs and results will be printed on screen.
Another simple way to run the code is to execute python3 "code_file.py" "data folder" "supervision ratio". For the above example, you can directly execute python3 MTransE.py ../ISWC2018/dbp_wd_15k_V1/ 0.3.
As for the parameters used in referred methods, you can modify them as you need in file "param.py".
The file detailed_result.csv contains our detailed experimental results. Folder "figure" contains some figures about our experimental results.
If you have any difficulty or question about our datasets, source code or reproducing expriment results, please email to qhzhang.nju@gmail.com, zqsun.nju@gmail.com or whu@nju.edu.cn.