Improved decision making for water lead testing in U.S. child care facilities using machine-learned Bayesian networks

Last updated: January 23, 2024

This repository provides all the data and code needed to recreate the Bayesian Network (BN) models used to predict building-wide water lead risk in child care facilities in the publication below:

Mulhern, R. E.; Kondash, A.; Norman, E.; Johnson, J.; Levine, K.; McWilliams, A.; Napier, M.; Weber, F.; Stella, L.; Wood, E.; Lee Pow Jackson, C.; Colley, S.; Cajka, J.; MacDonald Gibson, J.; Redmon, J. H. "Improved Decision Making for Water Lead Testing in U.S. Child Care Facilities Using Machine-Learned Bayesian Networks." Environmental Science and Technology. 2023, 57, 46, 17959–17970.

Data

The data set for this work is based on first-draw water lead sampling from over 4,000 child care centers in North Carolina, collected by the Clean Water for Carolina Kids program: https://www.cleanwaterforcarolinakids.org/. The data set is deidentified and provides eight binary target variables as well as compiled predictor variables for machine learning. The target variables are defined as follows:

maxabove1 - Whether the maximum first-draw lead concentration at each facility exceeded 1 ppb.
maxabove5 - Whether the maximum first-draw lead concentration at each facility exceeded 5 ppb.
maxabove10 - Whether the maximum first-draw lead concentration at each facility exceeded 10 ppb.
maxabove15 - Whether the maximum first-draw lead concentration at each facility exceeded 15 ppb.
perc90above1 - Whether the 90th percentile first-draw lead concentration at each facility exceeded 1 ppb.
perc90above5 - Whether the 90th percentile first-draw lead concentration at each facility exceeded 5 ppb.
perc90above10 - Whether the 90th percentile first-draw lead concentration at each facility exceeded 10 ppb.
perc90above15 - Whether the 90th percentile first-draw lead concentration at each facility exceeded 15 ppb.

The data set to reproduce the analysis and data dictionary are located in the folder: childcare_lead_BNmodels/data

Software

All models were built in RStudio which can be downloaded here: https://posit.co/download/rstudio-desktop/

Additional software packages that will need to be downloaded and installed from CRAN include:

dplyr - used for data wrangling
tidyverse - used for data wrangling
ggplot2 - used for plotting
ggrepel - used for plotting labels
gRain - used for plotting BN networks
visNetwork - used for plotting BN network structures
Rgraphviz - used for plotting BN network structures
bnlearn - used for learning Bayesian network structures

To estimate conditional probabilities of the network with missing data, the latest release of bnlearn may need to be downloaded here: https://www.bnlearn.com/releases/bnlearn_4.9-20230207.tar.gz

ForestDisc - used for random forest discretizations of continuous variables
ROCR - used to evaluate performance using ROC curve
purrr - used to compile ROC values from nested lists
caret - used to generate the confusion matrix
zoo - used to smooth the fit of the improvement plots

R scripts

All scripts below are located in the folder: childcare_lead_BNmodels/scripts

The main code required to build a model for each target is Mulhern_et_al_BN_model_script_as_published.R The target node must be manually set by the user. This script will allow the user to visualize and save the outputs for a single model at a time.

If only a single target is of interest, then no other scripts are necessary. (This script may also serve as a template for other open source machine learning applications using Bayesian networks by replicating the basic pre-processing and machine learning steps shown in the script, including: defining numerical and categorical variables, splitting the data set into training and test sets, discretizing continuous variables, learning the network structure, selecting significant predictor nodes, and assessing the model's performance.)

In order to summarize the outputs of all eight models shown in the cited manuscript, the above script must be run iteratively eight times for each target node. The subsequent scripts then summarize the outputs of all eight models. These additional scripts are described below and should be run in the following order since the outputs of some are used as the inputs to others:

improvement_summary.R - This script generates Figures 4 and 6 in the manuscript to compare the F-scores, sensitivity improvement, and sampling reduction metrics achieved by the BN models compared to the various alternative heuristics.
performance_summary.R - This script generates Figure 2 and Figure S9 in the manuscript and an overall summary table of the performance metrics of all eight models.
sigvars_summary_all_models.R - This script generates Figure 3 in the manuscript to visualize the frequency of variables selected across all eight models.
network_structure_summary.R - This script generates clean versions of the network structures for all eight nodes. Interactive versions of the outputs can be seen at: https://www.cleanwaterforcarolinakids.org/publications/bn_models
tornado_chart_summary.R - This script generates the tornado plots in Figures S11 through S15 in the manuscript Supporting Information. These plots help visualize the effect of important variables on water lead risk across all models where they were selected.

Questions

Questions about the code in this repository should be directed to RTI International staff at joejohnson@rti.org, jredmon@rti.org, akondash@rti.org, and cleanwater@rti.org.

RTIInternational / childcare_lead_BNmodels

Improved decision making for water lead testing in U.S. child care facilities using machine-learned Bayesian networks

Data

Software

R scripts

Questions

About

Languages