DEML PM2.5 Estimation Presentation (2023-04-21)

This is a sharing materal for presentation in Jinan University.

Presentation slides.

R code demonstration

Check the package installations first

#install.packages("pacman")
library(pacman)
p_load("devtools","SuperLearner","ranger","CAST","caret","skimr","gbm","xgboost","hexbin")

Please install the 'deeper' R package:

Installation

You can install the developing version of deeper from github with:

library(devtools)
install_github("Alven8816/deeper")

Steps

Our newly build R package mainly include 3 steps:

* Step 1: establish the base models

Using predictModel() or predictModel_parallel() to establish the base models. A tuningModel() function can be used to tuning the parameters to get the best single base model.

* Step 2: stack the meta models

We use stack_ensemble(),stack_ensemble.fit(), or stack_ensemble_parallel() function to stack the meta models to get a DEML model.

* Step 3: prediction based on the new data set

After establishment of DEML model, the predict() can be used predict the unseen data set.

To assess the performance of the models, assess.plot() can be used by comparing the original observations (y in test set) and the prediction. The assess.plot() also return the point scatter plot.

The other functions including CV.predictModel() (referred from SuperLearner), CV.predictModel_parallel(),and CV.stack_ensemble_parallel() could be used to conduct the external cross validation to assess the base models and DEML models for each fold.

Note: DEML could not directly deal with missing values and missing value imputation technologies is recommended prior to the use of the DEML model.

Example

This is a basic example which shows you how to use deeper:

1. Data preparation

Load the example data

library(deeper)

## obtain the example data
data("envir_example")

#knitr::kable(envir_example[1:6,])

summary(envir_example)

separate it to training and testing dataset

80% data was used as training and the reminding as testing.

set.seed(1234)
size <-
  caret::createDataPartition(y = envir_example$PM2.5, p = 0.8, list = FALSE)
trainset <- envir_example[size, ]
testset <- envir_example[-size, ]

Identify the dependence and independence variables

y <- c("PM2.5")
x <- colnames(envir_example[-c(1, 6)]) # except "date" and "PM2.5"

2.The models can be used

# we need to select the algorithm previously
SuperLearner::listWrappers()
data(model_list) # get more details about the algorithm

In the 'type' column, "R": can be used for regression or classification;"N": can be used for regression but variables requre to be numeric style; "C": just used in classification.

3.Model Building

3.1 Tuning the paramaters of a base model

#method 1: using deeper package function
ranger <-
  tuningModel(
    basemodel  = 'SL.ranger',
    params = list(num.trees = 100),
    tune = list(mtry = c(1, 3, 7))
  )

3.2 Establish DEML model

3.2.1 Training the (base) ensemble model

# training the model with separate new data set

pred_m2 <-
  predictModel(
    Y = trainset[, y],
    X = trainset[, x],
    base_model = c("SL.xgboost", ranger),
    cvControl = list(V = 5)
  )

# predict the new dataset
pred_m2_new <- deeper::predict(object = pred_m2, newX = testset[, x])

# the base model prediction
head(pred_m2_new$pre_base$library.predict)

The results show the weight, R2, and the root-mean-square error (RMSE) of each model. "ranger_1","ranger_2","ranger_3" note the Random Forest model with parameters mtry = 1,3,7 separately.

The results show that mtry = 3 is the best RF model for this prediction.This method could be used to tune the suitable parameters for algorithms.

3.2.2 Training the (base) ensemble model with parallel computing

## conduct the spatial CV

# Create a list with 7 (folds) elements (each element contains index of rows to be considered on each fold)

indices <-
  CAST::CreateSpacetimeFolds(trainset, spacevar = "code", k = 7)

# Rows of validation set on each fold

v_raw <- indices$indexOut

names(v_raw) <- seq(1:7)

pred_m3 <- predictModel_parallel(
  Y = trainset[, y],
  X = trainset[, x],
  base_model = c("SL.xgboost", ranger),
  cvControl = list(V = length(v_raw), validRows = v_raw),
  #number_cores = 4,
  seed = 1
)
## when number_cores is missing, it will indicate user to set one based on the operation system.

# pred_m3 <- predictModel_parallel(
#     Y = trainset[,y],
#     X = trainset[,x],
#     base_model = c("SL.xgboost",ranger),
#     cvControl = list(V = length(v_raw), validRows = v_raw),
#     seed = 1
#   )

#You have 8 cpu cores, How many cpu core you want to use:
# type the number to continue the process.


# prediction
pred_m3_new <- deeper::predict(object = pred_m3, newX = testset[, x])

head(pred_m3_new$pre_base$library.predict)

# test dataset performance
test_p1 <-
  assess.plot(pre = pred_m3_new$pre_base$pred, obs = testset[, y])

print(test_p1$plot)

3.3 Stacked meta models

Our DEML model includes several meta-models based on the trained base models' results. Different from other ensemble (stacking method), DEML model allow to select several meta-models and run them parallelly and ensemble all the meta-models with the optimal weights.(This step is optinal for analysis).

#Include original feature
pred_stack_10 <-
  stack_ensemble_parallel(
    object = pred_m3,
    Y = trainset[, y],
    meta_model = c("SL.ranger", "SL.xgboost", "SL.glmnet"),
    original_feature = TRUE,
    X = trainset[, x],
    number_cores = 4
  )

pred_stack_10_new <-
  deeper::predict(object = pred_stack_10, newX = testset[, x])

caret::R2(pred = pred_stack_10_new$pre_meta$pred,obs = testset$PM2.5)
caret::RMSE(pred = pred_stack_10_new$pre_meta$pred,obs = testset$PM2.5)

4. Assess and plot the results

test_p2 <-
  assess.plot(pre = pred_stack_10_new$pre_meta$pred, obs = testset[, y])

print(test_p2$plot)

Citation

Wenhua Yu, Shanshan Li, Tingting Ye,Rongbin Xu, Jiangning Song, Yuming Guo (2022) Deep ensemble machine learning framework for the estimation of PM2.5 concentrations,Environmental health perspectives: https://doi.org/10.1289/EHP9752

Alven8816 / DEML_PM2.5_estimation