Code appendix accompanying the paper "Entity Resolution and the Downstream Task: A Case Study of North Carolina Voter Registration Records"
This appendix contains the following folders:
- 0_create_data -- A folder of scripts to either simulate data (geco) or clean data available from the web used in the paper (caswell_voters).
- 1_record_linkage -- Configuration files for performing record linkage using the dblink (v0.1.0) package. Also includes an R script for organizing the resulting posterior draws.
- 2_canonicalization -- Scripts for performing canonicalization on the Caswell county and GeCO data sets.
- 3_downstream_task -- Scripts for performing regression tasks after canonicalization has occurred.
- 4_figures_and_tables -- Scripts to create all figures and tables in the paper after all other code has been run.
The code is presented in the order it should be run, from the folder labeled 0 up to the folder labeled 4.
The following is a list of packages and technologies that must be installed and where they can be found.
MySQL
See https://dev.mysql.com/doc/mysql-getting-started/en/ for details on getting started with MySQL.
R version > 4.1.1
- ggplot2 (CRAN)
- dplyr (CRAN)
- tidyr (CRAN)
- babynames (CRAN)
- readxl (CRAN)
- rvest (CRAN)
- sparklyr (CRAN)
- sparklyr.nested (https://github.com/mitre/sparklyr.nested)
- representr (CRAN)
- rstanarm (CRAN)
- tidyverse (CRAN)
- knitr (CRAN)
- kableExtra (CRAN)
- representr (CRAN)
Python 2.7.16
- geco-data-generator-corruptor (https://dmm.anu.edu.au/geco/index.php)
- datetime
- random
Apache Spark
- spark-2.3.1
- dblink v0.1.0 (https://github.com/cleanzr/dblink/releases/tag/v0.1)