chopin: Computation for Climate and Health research On Parallelized INfrastructure
Objective and target users
Objective
This package automates
parallelization
in spatial operations with chopin functions as well as
sf/terra
functions. With GDAL-compatible files and
database tables, chopin functions help to calculate spatial
variables from vector and raster data with no external software
requirements.
For whom chopin is useful
Following user groups will find this package useful to accelerate
the covariate calculation process for further analysis and modeling:
Environmental health researchers and data analysts
Health geographers and spatial epidemiologists
Spatial analysts who need to perform geospatial operations with
large datasets
We assume that users–
Can run R functions following relevant instructions;
Understood and planned what they want to calculate; and
Collected datasets they need
Notes on data restrictions
This package works best with two-dimensional (planar)
geometries. Users should disable s2 spherical geometry mode in
sf by setting. Running any chopin functions at spherical or
three-dimensional (e.g., including M/Z dimensions) geometries may
produce incorrect or unexpected results.
sf::sf_use_s2(FALSE)
Basic design
Processing functions accept
sf/terra’s
classes for spatial data. Raster-vector overlay is done with
exactextractr.
From version 0.3.0, this package supports three basic functions that
are readily parallelized over multithread environments:
extract_at: extract raster values with point buffers or
polygons.
extract_at_buffer: extract raster values at circular
buffers; kernel weight can be applied
summarize_aw: area-weighted covariates based on target and
reference polygons
When processing points/polygons in parallel, the entire study area
will be divided into partly overlapped grids or processed through
its own hierarchy. We suggest two flowcharts to help which function
to use for parallel processing below. The upper flowchart is
raster-oriented and the lower one is vector-oriented. They are
separated but supplementary to each other. When a user follows the
raster-oriented one, they might visit the vector-oriented flowchart
at each end of the raster-oriented flowchart.
par_grid: parallelize over artificial grid polygons that are
generated from the maximum extent of inputs. par_make_gridset
is used to generate the grid polygons before running this
function.
par_hierarchy: parallelize over hierarchy coded in identifier
fields (for example, census blocks in each county in the US)
par_multirasters: parallelize over multiple raster files
These functions are designed to be used with future and doFuture
packages to parallelize over multiple CPU threads. Users can choose
the number of threads to be used in the parallelization process.
Users always need to register parallel workers with future and
doFuture before running the three functions above.
doFuture::registerDoFuture()
future::plan(future::multicore, workers=4L)
# future::multisession, future::cluster are available,# See future.batchtools and future.callr for other options# the number of workers are up to users' choice
To run the examples
RStudio: download and open this document then press “Run All Chunks
Above”, “Run All Chunks Below”, or “Restart R and Run All Chunks”,
whichever is appropriate.
Visual Studio Code (with R extension): download and open this
document then press “Run Above” at the last code chunk.
If you prefer command line (i.e., in Unix-like operating systems),
run:
git clone https://github.com/Spatiotemporal-Exposures-and-Toxicology/chopin
cd chopin
Rscript -e \
"knitr::purl(\"README.Rmd\", \"README_run.r\")source(\"README_run.r\")"
Installation
chopin can be installed using remotes::install_github (also
possible with pak::pak or devtools::install_github).
Examples will navigate par_grid, par_hierarchy, and
par_multirasters functions in chopin to parallelize geospatial
operations.
# check and install packages to run examplespkgs<- c("chopin", "dplyr", "sf", "terra",
"future", "future.apply", "doFuture", "testthat")
# install packages if anything is unavailablerlang::check_installed(pkgs)
# disable spherical geometries# it does the same as sf::sf_use_s2(FALSE)
options(sf_use_s2=FALSE)
# parallelization-safe random number generator
set.seed(2024, kind="L'Ecuyer-CMRG")
par_grid: parallelize over artificial grid polygons
Please refer to a small example below for extracting mean altitude
values at circular point buffers and census tracts in North
Carolina.
Before running code chunks below, set the cloned chopin repository
as your working directory with setwd()
We use an elevation dataset with and a moderate spatial resolution
(approximately 400 meters or 0.25 miles).
# data preparationwdir<- system.file("extdata", package="chopin")
path_srtm<- file.path(wdir, "nc_srtm15_otm.rds")
# terra SpatRaster objects are wrapped when exported to rds filesrtm<-terra::unwrap(readRDS(path_srtm))
terra::crs(srtm) <-"EPSG:5070"srtm#> class : SpatRaster #> dimensions : 1534, 2281, 1 (nrow, ncol, nlyr)#> resolution : 391.5026, 391.5026 (x, y)#> extent : 1012872, 1905890, 1219961, 1820526 (xmin, xmax, ymin, ymax)#> coord. ref. : NAD83 / Conus Albers (EPSG:5070) #> source(s) : memory#> name : file928c3830468b #> min value : -3589.291 #> max value : 1946.400
terra::plot(srtm)
ncpoints_tr<-terra::vect(ncpoints)
system.time(
ncpoints_srtm<-chopin::extract_at(
vector=ncpoints_tr,
raster=srtm,
id="pid",
mode="buffer",
radius=1e4L# 10,000 meters (10 km)
)
)
#> user system elapsed #> 9.925 0.197 10.149
Generate regular grid computational regions
chopin::par_make_gridset takes a spatial dataset to generate
regular grid polygons with nx and ny arguments with padding.
Users will have both overlapping (by the degree of radius) and
non-overlapping grids, both of which will be utilized to split
locations and target datasets into sub-datasets for efficient
processing.
compregions is a list object with two elements named original
(non-overlapping grid polygons) and padded (overlapping by
padding). The figures below illustrate the grid polygons with and
without overlaps.
Using the grid polygons, we distribute the task of averaging
elevations at 10,000 circular buffer polygons, which are generated
from the random locations, with 10 kilometers radius by
chopin::par_grid.
Users always need to register multiple CPU threads (logical
cores) for parallelization.
chopin::par_* functions are flexible in terms of supporting
generic spatial operations in sf and terra, especially where two
datasets are involved.
Users can inject generic functions’ arguments (parameters) by
writing them in the ellipsis (...) arguments, as demonstrated
below:
future::plan(future::multicore, workers=4L)
system.time(
ncpoints_srtm_mthr<-chopin::par_grid(
grids=compregions,
grid_target_id=NULL,
fun_dist=chopin::extract_at,
vector=ncpoints_tr,
raster=srtm,
id="pid",
mode="buffer",
radius=1e4L
)
)
#> Your input function was successfully run at CGRIDID: 4#> Your input function was successfully run at CGRIDID: 5#> Warning: [buffer] empty SpatVector#> Warning: [buffer] empty SpatVector#> Warning: [buffer] empty SpatVector#> Your input function was successfully run at CGRIDID: 6#> Your input function was successfully run at CGRIDID: 7#> Your input function was successfully run at CGRIDID: 8#> Your input function was successfully run at CGRIDID: 10#> Warning: [buffer] empty SpatVector#> Your input function was successfully run at CGRIDID: 11#> Your input function was successfully run at CGRIDID: 12#> Your input function was successfully run at CGRIDID: 13#> Your input function was successfully run at CGRIDID: 14#> Your input function was successfully run at CGRIDID: 15#> Your input function was successfully run at CGRIDID: 16#> Your input function was successfully run at CGRIDID: 17#> Your input function was successfully run at CGRIDID: 18#> Your input function was successfully run at CGRIDID: 19#> Your input function was successfully run at CGRIDID: 20#> Your input function was successfully run at CGRIDID: 21#> Your input function was successfully run at CGRIDID: 22#> Your input function was successfully run at CGRIDID: 23#> Your input function was successfully run at CGRIDID: 24#> Your input function was successfully run at CGRIDID: 25#> Your input function was successfully run at CGRIDID: 26#> Your input function was successfully run at CGRIDID: 27#> Your input function was successfully run at CGRIDID: 28#> Your input function was successfully run at CGRIDID: 29#> Your input function was successfully run at CGRIDID: 30#> Your input function was successfully run at CGRIDID: 31#> Your input function was successfully run at CGRIDID: 32#> Your input function was successfully run at CGRIDID: 33#> Your input function was successfully run at CGRIDID: 34#> Warning: [buffer] empty SpatVector#> Your input function was successfully run at CGRIDID: 37#> Your input function was successfully run at CGRIDID: 38#> Your input function was successfully run at CGRIDID: 39#> Warning: [buffer] empty SpatVector#> Warning: [buffer] empty SpatVector#> user system elapsed #> 8.418 1.567 3.447
colnames(ncpoints_srtm_mthr)[2] <-"mean_par"ncpoints_compar<- merge(ncpoints_srtm, ncpoints_srtm_mthr)
# Are the calculations equal?
all.equal(ncpoints_compar$mean, ncpoints_compar$mean_par)
#> [1] TRUE
chopin::par_hierarchy: parallelize geospatial computations using intrinsic data hierarchy
In real world datasets, we usually have nested/exhaustive
hierarchies. For example, land is organized by
administrative/jurisdictional borders where multiple levels exist.
In the U.S. context, a state consists of several counties, counties
are split into census tracts, and they have a group of block groups.
chopin::par_hierarchy leverages such hierarchies to parallelize
geospatial operations, which means that a group of lower-level
geographic units in a higher-level geography is assigned to a
process.
A demonstration below shows that census tracts are grouped by their
counties then each county will be processed in a CPU thread.
Read data
path_nchrchy<- file.path(wdir, "nc_hierarchy.gpkg")
nc_data<-path_nchrchync_county<-sf::st_read(nc_data, layer="county")
#> Reading layer `county' from data source #> `/tmp/RtmpdYJu7A/temp_libpath16ff5d6da0d43d/chopin/extdata/nc_hierarchy.gpkg' #> using driver `GPKG'#> Simple feature collection with 100 features and 1 field#> Geometry type: POLYGON#> Dimension: XY#> Bounding box: xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176#> Projected CRS: NAD83 / Conus Albers
nc_tracts<-sf::st_read(nc_data, layer="tracts")
#> Reading layer `tracts' from data source #> `/tmp/RtmpdYJu7A/temp_libpath16ff5d6da0d43d/chopin/extdata/nc_hierarchy.gpkg' #> using driver `GPKG'#> Simple feature collection with 2672 features and 1 field#> Geometry type: MULTIPOLYGON#> Dimension: XY#> Bounding box: xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176#> Projected CRS: NAD83 / Conus Albers
par_multirasters: parallelize over multiple rasters
There is a common case of having a large group of raster files at
which the same operation should be performed.
chopin::par_multirasters is for such cases. An example below
demonstrates where we have five elevation raster files to calculate
the average elevation at counties in North Carolina.
The figure below shows the padded grids (50 kilometers), primary
roads, and points. Primary roads will be selected by a padded grid
per iteration and used to calculate the distance from each point to
the nearest primary road. Padded grids and their overlapping areas
will look different according to padding argument in
chopin::par_make_gridset.
# terra::nearest run
system.time(
restr<-terra::nearest(x=pnts, y=rd1)
)
#> user system elapsed #> 0.462 0.003 0.466
# we use four threads that were configured above
system.time(
res<-chopin::par_grid(
grids=nccompreg,
fun_dist=terra::nearest,
x=pnts,
y=rd1
)
)
#> Your input function was successfully run at CGRIDID: 1#> Your input function was successfully run at CGRIDID: 2#> Your input function was successfully run at CGRIDID: 3#> Your input function was successfully run at CGRIDID: 4#> Your input function was successfully run at CGRIDID: 5#> Your input function was successfully run at CGRIDID: 6#> Your input function was successfully run at CGRIDID: 7#> Your input function was successfully run at CGRIDID: 8#> user system elapsed #> 0.776 0.562 0.521
We will compare the results from the single-thread and multi-thread
calculation.
Users should be mindful of potential caveats in the parallelization
of nearest feature search, which may result in no or excess distance
depending on the distribution of the target dataset to which the
nearest feature is searched.
For example, when one wants to calculate the nearest interstate
from rural homes with fine grids, some grids may have no
interstates then homes in such grids will not get any distance
to the nearest interstate.
Such problems can be avoided by choosing nx, ny, and
padding values in par_make_gridset meticulously.
Why parallelization is slower than the ordinary function run?
Parallelization may underperform when the datasets are too small to
take advantage of divide-and-compute approach, where parallelization
overhead is involved. Overhead here refers to the required amount of
computational resources for transferring objects to multiple
processes.
Since the demonstrations above use quite small datasets, the
advantage of parallelization was not as noticeable as it was
expected. Should a large amount of data (spatial/temporal resolution
or number of files, for example) be processed, users could see the
efficiency of this package. Please refer to a
vignette
in this package for the demonstration of various climate/weather
datasets.
Last edited: March 22, 2024
About
Scalable GIS methods for environmental and climate data analysis