This repository contains code and data for a feasibility study of formally private methods for histograms, means, quantiles, and linear regression by Andres Felipe Barrientos, Aaron R. Williams, Joshua Snoke, and Claire McKay Bowen.
This repository contains code and data for A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data (JASA). The paper evaluates the results of several differentially private methods for a range of statistics on two case studies.
One case study uses confidential tax microdata that is not available for reproduction. The second case study uses public microdata from the 1994 to 1996 Current Population Survey Annual Social and Economic Supplements (CPS ASEC).
The .R
scripts in analyses/
iterate the tests with different methods, applications, and data. 01
is summary statistics for the SOI, 02
is regression methods for the SOI, 03
is summary statistics for the CPS, and 04
is regression methods for the CPS. These scripts were mostly run on AWS instances of varying sizes, though the scripts contain non-iterated examples that are commented out.
analyses/01_soi_histograms.R
analyses/01_soi_means.R
analyses/01_soi_quantiles.R
analyses/02_soi_regression.R
analysis/03_cps_histograms.R
analyses/03_cps_mean-income.R
analyses/03_cps_mean-earned-income.R
analyses/03_cps_quantiles.R
analyses/04_cps_female-regression.R
analyses/04_cps_male-regression.R
The iteration scripts write results and summaries of results to the results/
folder. The following five documents contain documentation about the applications and summarize the results from results/
:
analyses/01_soi-puf-summary-stats.Rmd
analyses/02_soi-puf-regression.Rmd
analyses/03_cps-summary-stats.Rmd
analyses/04_cps_female-regression.Rmd
analyses/04_cps_male-regression.Rmd
The results of those documents can be viewed here (for the version on the master
branch):
- 01 PUF Summary Statistics
- 02 PUF Regression Analysis
- 03 CPS Summary Statistics
- 04 CPS Regression Analysis (Female)
- 04 CPS Regression Analysis (Male)
Most of the methods have wrapper functions like lm_*
and quantile_*
in R/
. The wrapper functions provide a consistent interface around methods with varying styles in R and Python. These methods are in dp-code/
.
tax-code/
and R/prep_*.R
contain code for setting up the example analyses.
The CPS example uses data pulled from IPUMS CPS.
Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, J. Robert Warren and Michael Westberry. Integrated Public Use Microdata Series, Current Population Survey: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2022. https://doi.org/10.18128/D030.V10.0
The data are loaded using library(ipumsr)
and the functions read_ipums_ddi()
and read_ipums_micro()
. These functions use the *.xml
and *.dat.gz
files in the data/
directory to load the CPS microdata into R.
cps_00008.xml
is used for quantiles and means. cps_00009.xml
is used for histograms. cps_00012.xml
and cps_00013.xml
are used for linear regressions.
- The histogram iteration programs do not use parallel computing.
- The quantile iteration programs do not use parallel computing because they reference python code. The "smooth" method is very slow.
- The mean programs use 14 workers and need a computer like an AWS c5.4xlarge.
- The regression programs use 34 workers and need a computer like an AWS c5.9xlarge.
The programs rely on R packages from CRAN and GitHub, and Python packages from PIP, Conda, and GitHub. config.sh
sets up the computing environment including the packages and Conda environments. library(synthpuf)
is cloned by config.sh
but needs to be built by opening the package in RStudio and building.
Please contact Aaron R. Williams with questions.