This assignment aims to analyze the performance of PARIS, a framework for Knowledge Graph (KG) Alignment (and more), based on a probabilistic method. Specifically, we test it against two small KGs (FB15k and DB15k) and analyze what fraction of the real alignment PARIS gets right. In order to do so, Precision, Recall, F1 score, and total running time are computed and elaborated. This README enables to easily run our evaluation code to produce the same results we achieved (some small difference are acceptable during the intrinsic randomization).
The minimum requirements for the main.py
are:
Python
(tested on version 3.8.2)- pip (tested on version 20.1.1) (For package installation, if needed)
numpy
(tested on version 1.18.1)pandas
(tested on version 1.0.4)matplotlib
(tested on version 3.1.3)
We used a LaTeX backend for our plots so that they have a "LaTeX style" and we can use its syntax for axes labeling, titles, and anything else. If you want to reproduce the plots (using the --plots
flag, see Usage instruction below), you will need to have LaTeX installed locally on your machine. The installation really depends on your OS, so we recommend to go to LaTeX official website for more information.
For Ubuntu users, we recommend using apt-get
to install. We installed the full version on our machines (Note that this version is around 6GB big):
sudo apt-get install texlive-full
You can try a smaller version if you don't have enough space, but after that, you may get some missing LaTeX packets, so Google for them is the best choice.
We provide a notebook Analysis.ipynb
with disclosed output with all the plots we produce in the report and in the main.py
. If you don't want to install LaTeX you can go over this notebook to see how the plots have been generated (Note: that might be some small differences in how these plots are done, especially regarding the figure size). This notebook provide also the same aggregated metrics you can obtain by running the script.
- Open CMD/Bash
- Move to the root folder, where the
main.py
is located - Execute the command
python3 main.py
, eventually adding one or more of the following arguments:
Optional arguments:
-h, --help show this help message and exit
--no_paris Use this flag to avoid running PARIS and load precomputed
results from pickle instead. If not set, the full algorithm will
be executed 20 times for 3 different seed fractions
(10%/20%/50%).This may require about an hour.
--plots Produce the same plots as shown in the report and in the
notebook and save them to pdf for later use.
Whatever your choice, aggregated metrics for the last iteration only and 95% confidence interval will be computed. If you choose to run the complete PARIS algorithm intermediate metrics for each run will be printed to the console.
.
├── data
| ├── original # Original txt of DB15k and FB15k
| | ├── DB15K_EntityTriples.txt
| | ├── FB15K_EntityTriples.txt
│ | └── DB15K_SameAsLink.txt
│ ├── pkl # Precomputed pickles for all the possible seeds
| | ├── 0.1
| | | ├── f1_scores.pkl
| | | ├── precisions.pkl
│ | | ├── recalls.pkl
| | | └── timings.pkl
│ | ├── 0.2
| | | ├── f1_scores.pkl
| | | ├── precisions.pkl
│ | | ├── recalls.pkl
| | | └── timings.pkl
│ | └── 0.5
| | ├── f1_scores.pkl
| | ├── precisions.pkl
│ | ├── recalls.pkl
| | └── timings.pkl
| ├── DB15K_EntityTriples.nt # Converted in N-Triples format
| ├── FB15K_EntityTriples.nt # Converted in N-Triples format
| └── DB15K_SameAsLink.nt # Converted in N-Triples format
|
├── plots # Folder to store the plots
| ├── ....
| └── Different kind of plots .pdf
|
├── main.py # Main entry point
├── paris_0_3.jar # PARIS JAR to execute the experiments
├── plot.py # Plotting utility script
├── requirements.txt # Python requirements
├── Analysis.ipynb # Notebook with disclosed output
├── Report.pdf # Report in PDF
└── README.md
Note: additional folders will be created during the code execution to store PARIS intermediate results and seeded N-Triples dataset. These folders are not showed here as they are not necessary to start the algorithm (The script will create them) and they will be overwritten at each run.
Due to the probabilistic nature of PARIS, together with the random seed generation, the user is advised that some small differences are possible among different runs. However, we found PARIS to be a very stable algorithm so the overall conclusion remains the same. Our configuration: especially for the timing analysis, it is important to consider the configuration we used to run our experiments. That is a Lenovo Thinkpad T450s with 12GB RAM and Intel I7-5600U CPU
Moreover, a special note has to be added for the two jupyter notebooks: since we are using the Latex style for the plots, sometimes we have faced unexpected troubles at rendering such graphics (running the cells which should output the plot does not have any effect). We didn't manage to solve this bug, which occurred irregularily and is not well documented on the Internet. If you would ever run the code and face such trouble, we noticed that it is enough to re-run the matplotlib
import and the latex style cells multiple times to solve such issues and render regularly the plots.
- Manuel Leone
- Stefano Huber