The code repository of paper "Computing Graph Edit Distance via Neural Graph Matching" for PVLDB 2023. This figure gives an overview of our GEDGNN model.
We implement GEDGNN and other baseline machine learning models in models.py
and GedMatrix.py
. Import these models by:
from models import GPN, SimGNN, GedGNN, TaGSim
The kbest_matching_with_lb.py
. Import it by:
from kbest_matching_with_lb import KBestMSolver
To reproduce the experimental results in our paper, please refer to the folder experiments/ which contains three projects (Overall Performance, Path Results and Powerlaw Graphs). The first two projects correspond to the experiments on real graph data sets in Table 3, Figure 3 and Figure 5 of the paper, and the last one corresponds to the experiments on large-scale synthetic power-law graphs in Figure 4 of the paper. They share the same core code but have slight differences in ground-truth generation and evaluation metrics.
Taking Overall Performance
as an example, the content of a project is shown in the figure below.
All projects can be run on a Windows system without GPU. To run them on Linux or train GEDGNN with GPU, minor adjustments on source code are necessary.
Install python3.8 and other packages as specified in requirements.txt
.
- dgl==0.7.0
- matplotlib==3.3.4
- networkx==2.5
- numpy==1.20.1
- scipy==1.6.2
- texttable==1.6.4
- torch==1.8.2+cpu
- torch_geometric==2.0.4
- tqdm==4.59.0
Under the Overall Performance
directory, we can train GEDGNN for 20 epochs and test the corresponding models using:
python src/main.py --model-name GedGNN --dataset AIDS --model-epoch-start 0 --model-epoch-end 20 --model-train 1
Run post-processing using the 20-th epoch model by:
python src/main.py --model-name GedGNN --dataset AIDS --model-epoch-start 20 --model-epoch-end 20 --model-train 0
AIDS
for the parameter dataset
can be replaced by other data sets Linux
and IMDB
.
GedGNN
for the parameter model-name
can be replaced by baseline models SimGNN
, GPN
and TaGSim
.
The way of running Path Results
and Powerlaw Graphs
projects is the same. The detailed running parameters for training, testing and post-processing can be found in arg.txt
under each directory.
In the following we present a running demo of Overall Performance
.
- Firstly, we train GEDGNN for 2 epochs.
python src/main.py --model-name GedGNN --dataset AIDS --model-epoch-start 0 --model-epoch-end 2 --model-train 1 --demo
Note: The parameter --demo
can be used for a quick demo run. In this case, only a small number of training and testing graph pairs are used. Please do not use --demo
in formal evaluations.
In each epoch, the project will do the following things sequentially.
- Train the model for 1 epoch, and then output the training results.
- Store the latest model under
model_save/
. Note that all historical models are stored asAIDS_1
,AIDS_2
, ... for further evaluation. - Test the latest model, and then output the testing results. Note that in this step we merely test the GED value prediction generated by the machine learning model. For GEDGNN, the results denote the performance of GEDGNN-value.
- Secondly, we run the post-processing algorithm using
AIDS_2
(the model by the final epoch).
python src/main.py --model-name GedGNN --dataset AIDS --model-epoch-start 2 --model-epoch-end 2 --model-train 0 --demo
The results of post-processing denote the performance of GEDGNN-matching. Recall that an edit path is generated by the post-processing algorithm, and the predicted GED value is the length of this path.
- All results are sequentially output into
result/results.txt
. A sample result file is shown below which depicts the format of output.