Experimental analysis of knowledge graph embedding for entity alignment

We developed a degree-based sampling method to generate 42 alignment-oriented datasets from real-world large-scale KGs, representing different heterogeneities of the original KGs. We selected three state-of-the-art embedding-based entity alignment methods for evaluation and comparison. Furthermore, we observed that multi-mapping relations and literal embedding are the two main obstacles for embedding-based entity alignment and some preliminary solutions were attempted. Specifically, we leveraged several enhanced KG embedding models to handle multi-mapping relations and used word2vec to incorporate literal similarities into embeddings. Our findings indicate that the performance of existing embedding-based methods is influenced by the characteristics of datasets and not all KG embedding models are suitable for entity alignment. Alignment-oriented KG embedding remains to be explored.

Dataset

Description

We considered the following four aspects to build our datasets: source KG, dataset language, entity size and difference of degree distributions between the extracted datasets and original KGs. We selected three well-known KGs as our sources: DBpedia (2016-10), Wikidata (20160801) and YAGO3. For DBpedia, we also formed two cross-lingual datasets: English-French and English-German. In terms of entity sizes, we sampled two kinds of datasets with 15K and 100K entities, respectively. Each dataset contains two versions, V1 and V2, where V1 approximates the degree distribution of source KG, while V2 fits the doubled average degree. Due to lack of enough prior alignment, we only built V1 for cross-lingual DBP-100K. For each version, three samples were generated to prevent randomness. For each dataset, we have five files:

ent_links: reference entity alignmet
triples_1: relation triples of sampled entities in KG1
triples_2: relation triples of sampled entities in KG2
attr_triples_1: attribute triples of sampled entities in KG1
attr_triples_2: attribute triples of sampled entities in KG2

Download

All datasets can be downloaded from Datahub or Dropbox, in which three folders named "_1", "_2" and "_3" denote our three samples.

Degree distribution example

As shown below, this is an example of degree distributions of source KGs and sampled datasets. The sampled dataset in figure is WDB-WD-15K. The red curve represents the V1 version, and the blue curve represents the V2 version. The solid curve represents the source KG, and dotted curve represents the sampled dataset.

100K datasets statistics

The statistics of the 100K datasets are shown below.

DBP-WD-100K
		V1		V2
		DBpedia	Wikidata	DBpedia	Wikidata
Relations	S1	358	216	333	221
	S2	364	211	333	226
	S3	368	217	347	221
	AVG	363	215	338	223
Attributes	S1	463	807	349	740
	S2	486	791	390	731
	S3	466	783	402	756
	AVG	472	794	380	742
Rel. triples	S1	257,398	226,585	497,241	503,836
	S2	259,100	224,863	493,865	484,209
	S3	269,471	237,846	519,713	517,948
	AVG	261,990	229,765	503,606	501,998
Attr. triples	S1	399,424	593,332	385,004	838,155
	S2	398,373	587,581	397,852	830,654
	S3	397,787	619,950	389,973	856,447
	AVG	398,528	600,288	390,943	841,752

DBP-YG-100K
		V1		V2
		DBpedia	YAGO	DBpedia	YAGO
Relations	S1	326	30	311	31
	S2	358	31	320	31
	S3	337	30	303	31
	AVG	340	30	311	31
Attributes	S1	404	24	347	24
	S2	415	24	335	23
	S3	402	24	343	23
	AVG	407	24	342	23
Rel. triples	S1	261,038	277,779	457,197	535,106
	S2	281,143	318,434	443,115	522,817
	S3	280,904	313,147	457,888	529,100
	AVG	274,362	303,120	452,733	529,008
Attr. triples	S1	425,648	141,936	442,973	108,338
	S2	413,532	131,411	442,122	111,467
	S3	420,947	136,464	448,000	105,639
	AVG	420,042	136,604	444,365	108,481

		DBP(en_fr)-100K-V1		DBP(en_de)-100K-V1
		en	fr	en	de
Relations	S1	329	257	305	163
	S2	331	254	310	167
	S3	331	256	305	169
	AVG	330	256	307	166
Attributes	332	469	S1	360	494
	S2	331	478	361	494
	S3	331	480	357	489
	AVG	221	476	359	492
Rel. triples	S1	367,096	294,440	273,093	230,586
	S2	367,190	294,378	274,256	232,439
	S3	367,328	294,471	275,022	232,364
	AVG	367,205	294,430	274,124	231,796
Attr. triples	S1	403,321	361,330	437,144	684,663
	S2	402,443	361,648	436,472	685,318
	S3	402,764	361,788	439,633	689,150
	AVG	402,843	361,589	437,750	686,377

Code

Code files

Folder "code" contains two subfolders:

"comparative_method" contains the code of all comparative methods. The correspondence between code files and the methods are as follows:
- "MTransE.py": MTransE
- "IPTransE.py": IPTransE
- "JAPE.py": JAPE
- "TransD_plus.py": TransD+
- "TransH_plus.py": TransH+
- "TransH_2plus.py": TransH++
- "Label2Vec.py": Label2Vec
"data_handler" contains the code of our degree-based sampling method.

Dependencies

The code is based on Python 3, Tensorflow, Scipy, Numpy, sklearn.

Code running

For running code, you need to modify the training data path and the supervision ratio in code files and then execute python3 "code_file.py". For example, if you want to run MTransE on DBP-WD-15K-V1 with 30% supervision, you should first set the two parameters in the main function of MTransE.py as "../ISWC2018/dbp_wd_15k_V1/" and 0.3, respectively. Then you need to execute python3 MTransE.py. During running, logs and results will be printed on screen.

Another simple way to run the code is to execute python3 "code_file.py" "data folder" "supervision ratio". For the above example, you can directly execute python3 MTransE.py ../ISWC2018/dbp_wd_15k_V1/ 0.3.

As for the parameters used in referred methods, you can modify them as you need in file "param.py".

Experimental Results

The file detailed_result.csv contains our detailed experimental results. Folder "figure" contains some figures about our experimental results.

If you have any difficulty or question about our datasets, source code or reproducing expriment results, please email to qhzhang.nju@gmail.com, zqsun.nju@gmail.com or whu@nju.edu.cn.

About

Experimental analysis of knowledge graph embedding for entity alignment

Apache License 2.0

Languages

Language:Python 100.0%