praful-dodda / LCS_placement_sims

Comparing distributions of low-cost sensors in terms of accuracy and equity of real-time air quality information

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LCS_placement_sims

Investigating factors affecting accuracy and equity of real-time air quality information, resulting from deployment of low-cost sensors.

Analysis (chronologically, after Obtaining Data)

Functions used:

  • AQI_equation.R -- used to calculate the EPA AQI classifications from the "observed" concentrations.
  • Calibrate_PA.R -- Preparing the PurpleAir (PA) data and comparing it with nearby reference measurements, to inform the simulation of LCS measurement error.
  • Sim_functions.R -- for each trial, samples locations to "deploy" LCS, simulates measurement error at those locations, assigns each grid point the PM2.5 information from the nearest monitor/sensor, and calculates metrics comparing the air quality that is reported vs experienced, overall and for marginalized subpopulations, weighted and unweighted by population density.

Scripts in order:

  1. Run_sims_split-up.R -- runs 100 trials per placement strategy and number of LCS "deployed", saving both the results from each trial and the average metrics across the 100 trials. *Note: scripts to run these in parallel are in the On_cluster folder."
  2. Merging_results.R -- combines the results across the placement strategies and numbers of LCS "deployed", creating two results files weighted and unweighted by population density, respectively.
  3. Make_table1.R -- generates tables summarizing the effects of different types and amounts of sensor measurement error when LCS are placed (a) at all PurpleAir locations and (b) at all schools in California.
  4. Plot_multiple_MEs.R -- generates plots for manuscript (can easily run locally after transferring results files from cluster).

Additionally, the Summarize_MEs.R script can be used to calculate the standard deviation (weighted and unweighted by population density) of sensor measurement error when simulating differentially (either from a Normal distribution with mean zero and a standard deviation of 10% of 25% of "true" PM2.5, or drawing from EPA calibration residuals associated with the same decile of "true" PM2.5) at all the locations of PurpleAir sensors.

All scripts used to generate figures and tables are in the subfolder Generate-Figures-Tables-Info.

Obtaining Data (chronologically, prior to Analysis)

Scripts used to download and process the data sets upon which the simulations are based:

  • EPA_AQS.R -- processes AQS monitoring data from California, setting a few negative values to zero and only keeping daily averages from days with 18 or more hours observed. The AQS summary files can be found on the EPA website. We used the PM2.5 88101 and 88502 summary files from 2020.
  • Di et al. PM2.5 exposure estimates: these data can now be accessed here, however, they are in a slightly different format than what we originally received from the authors and used in our code.
    • QD_locations.R -- identifies which grid points in each file created by Di et al. are in California using a spatial overlay.
    • QD_get_CA.R -- cycles through the daily PM2.5 Di et al. files and extracts the measurements for grid points in California.
    • Combining_Di_data.R -- combines the daily PM2.5 estimates for California into one file (to be read in all at once).
  • Get_Nearest_PA_locations.R -- identifies PurpleAir sensors which are located within 50 meters of an AQS reference monitor in California. Uses a list of PurpleAir located outdoors, which is obtained in PA_historical_data.ipynb (below). All PurpleAir locations (indoor + outdoor) were obtained from their website on 4/16/21.
  • PA_historical_data.ipynb -- uses a wrapper module for the PurpleAir API to download data from outdoor PurpleAir sensors in California. As currently written, the user must specify "parent" or "child" (in the places indicated in the script) to obtain data from PurpleAir channels A or B, respectively. PurpleAir data (from 2020) were obtained via the API on 1/11/22 (channel A) and 1/23/22 (channel B).
  • School-Locs.R -- extracts the locations of public schools in California from a national shapefile, accessible here.
  • Road_lengths.R -- calculates lengths of major roads/highways within 50, 100, 250, and 500 meters (circular buffers) of each grid point in California, using the National Highway Planning Network shapefile, which can be accessed here.
  • Download-Census-ACS-data.R -- downloads and combines sociodemographic variables at the Census block group and Census tract levels with a shapefile of all the block groups, which contains general information such as population density. The script uses the package tidycensus. The variables downloaded can be changed in Census_variables.yml or Census_variables_tracts.yml
  • Merge_CA.R -- combines static information (locations of monitors and sensors, sociodemographic info, etc.) to use in the simulations. The CalEnviroScreen (CES) data can be downloaded here. We used CES 3.0 in this analysis; now, CES 4.0 is available.

Questions?

Contact Ellen Considine, ellen_considine@g.harvard.edu

About

Comparing distributions of low-cost sensors in terms of accuracy and equity of real-time air quality information

License:MIT License


Languages

Language:R 89.6%Language:Jupyter Notebook 9.7%Language:Shell 0.4%Language:Python 0.3%