ngmarchant / exchanger-experiments

Scripts for reproducing the experiments in our JSSAM article on Bayesian Graphical Entity Resolution

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Experiments for "Exchangeable clustering priors for Bayesian entity resolution"

This folder contains files required to reproduce the experiments for the following paper:

N. G. Marchant, B. I. P. Rubinstein, R. C. Steorts, "Bayesian Graphical Entity Resolution using Exchangeable Random Partition Priors," Journal of Survey Statistics and Methodology, 2023, smac030. DOI: 10.1093/jssam/smac030. arXiv: 2301.02962.

Data sets

Four of the five data sets are included in the datasets directory:

  • rest: with filename fz-nophone.arff.gz. Original source.
  • cora: with filename cora.arff.gz. Modified from original source to correct erroneous ground truth labels.
  • RLdata: with filename RLdata10000.csv.gz. From the RecordLinkage R package.
  • synthdata: with filenames matching the pattern gen_link-conf-mu-*_dist-conf-*_seed-*_exp-num-recs-*_records.csv.gz. The Python notebook used to generate the synthetic data is included.

nltcs is available from NACDA after signing a data usage agreement.

Dependencies

To run the experiments for our model, you must install the exchanger R package. It is hosted on GitHub and can be installed from within R using devtools::install_github("cleanzr/exchanger").

Similarly, to run the experiments for the model of Sadinle (2014), you must install the BDD R package. It is also hosted on GitHub and can be installed from within R using devtools::install_github("cleanzr/BDD").

Other dependencies include the following R packages from CRAN:

  • comparator
  • clevr
  • tidyverse
  • ggdist
  • egg

R scripts

The following scripts define functions that are shared across the experiments:

  • run_ours.R
  • run_sadinle.R
  • util.R

To run all of the experiments for one of the models, execute the following in a terminal:

$ Rscript run_<model>_all.R

replacing <model> with:

  • ours for our model,
  • blink for the model by Steorts (2015),
  • ours-blinkdist for our model with the distortion model by Steorts (2015), or
  • sadinle for the model by Sadinle (2014).

To run an experiment for a particular data set and model, execute the following in a terminal:

Rscript run_<model>_<dataset>.R

where <model> is defined as above and <dataset> can be one of cora, nltcs, restaurant, RLdata10000 or synthdata.

Output of an experiment

Each experiment will produce several files:

  • <prefix>_result.rds: the saved state of the model and Markov chain
  • <prefix>_eval.txt: pairwise and clustering evaluation metrics computed for a point estimate.
  • <prefix>_trace-*.png: various diagnostic plots. These vary for each model.

Reproducing tables and figures

After running all experiments, tables and figures can be reproduced as follows:

  • Figure 3 can be reproduced by running plot_err-num_ents_comparison.R
  • Table 2 and Figure S7 can be reproduced by running evaluate_prior_dist_model.R
  • Figure 4 can be reproduced by running plot_dist-level_comparison.R
  • Table 3 can be reproduced by running evaluate_models.R
  • Figures S5 and S6 can be reproduced by running evaluate_synthdata.R
  • Figure S8 can be reproduced by running plot_ep-params.R
  • Figure S9 can be reproduced by running plot_m_sadinle.R
  • The diagnostic plots in Appendix G can be reproduced by running plot_geweke.R and plot_trace.R

About

Scripts for reproducing the experiments in our JSSAM article on Bayesian Graphical Entity Resolution

License:GNU General Public License v3.0


Languages

Language:R 81.7%Language:Jupyter Notebook 18.3%