LIFE Healthy Forest

Benchmarking classifiers (SVM, RF, XGBOOST) on four different pathogens:

Armillaria
Diplodia
Heterobasidion
Fusarium

Hyperparameter tuning:

Sequential model-based optimization (SMBO)

Data

Stored at Mendeley Data. Data will be downloaded and processed when executing the project. Local storage directory is “./data”. After the intermediate R objects have been created, the data is deleted again to only function as a starting point and keep the directoy small. Only raster files need be kept as raster processes require a file stored on disk.

Workflow

This project is setup with a drake workflow, ensuring reproducibility. Intermediate targets/objects will be stored in a hidden .drake directory.

The R library of this project is managed by packrat. This makes sure that the exact same package versions are used when recreating the project. When calling packrat::restore(), all required packages will be installed with their specific version.

Please note that this project was built with R version 3.5.1 on a Debian 9 operating system. The packrat packages from this project are not compatible with R versions prior version 3.5.0. For reproducibility, we recommend to replicate the analysis using the included Dockerfile. Instructions can be found ħere. However, it should be possible to reproduce the analysis on any other operating system.

To clone the project, a working installation of git is required. Open a terminal in the directory of your choice and execute:

git clone https://venus.geogr.uni-jena.de/bi28yuv/pathogen-modelling

Then open a R session in this directory and run

packrat::restore()
source("scripts/drake.R")
make(plan, keep_going = TRUE, console_log_file=stdout()) 
# use more cores with make(plan, jobs = <number of cores>)

Runtime

Predicted total runtime

source("scripts/drake.R")
predict_runtime(drake_config(), from_scratch = TRUE, targets_only = TRUE)

## Warning: Some targets were never actually timed, And no hypothetical time was specified in `known_times`. Assuming a runtime of 0 for these targets:
##   benchmark_evaluation_report
##   bm_kknn
##   bm_rf
##   bm_svm
##   bm_xgboost
##   bm_brt
##   bm_gam_diplodia
##   bm_gam_fusarium
##   bm_gam_armillaria
##   bm_gam_heterobasidion
##   prediction_rf
##   prediction_svm
##   prediction_xgboost
##   prediction_kknn
##   prediction_glm
##   prediction_gam_diplodia
##   prediction_gam_fusarium
##   prediction_gam_armillaria
##   prediction_gam_heterobasidion

## [1] "6951.184s (~1.93 hours)"

Acceleration by parallelization of make() call

time <- c()
for (jobs in 1:10){
  time[jobs] <- predict_runtime(
    drake_config(),
    jobs = jobs,
    from_scratch = TRUE,
    known_times = build_times(targets_only = TRUE)$elapsed
  )
}

library(ggplot2)
ggplot(data.frame(time = time / 3600, jobs = ordered(1:10), group = 1)) +
  geom_line(aes(x = jobs, y = time, group = group)) +
  scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) +
  ggpubr::theme_pubr() +
  xlab("jobs argument of make()") +
  ylab("Predicted runtime of make() (hours)")

Docker

A Dockerfile is available in docker/. It was generated by the R package containerit and contains all packrat packages and system libraries with have been used to run the analysis (the file already exists, no need to do this).

remotes::install_github("pat-s/containerit@packrat")
library(containerit)
container = dockerfile(".", packrat = TRUE)
write(container, "docker/Dockerfile")

A docker container can be built and started from this Dockerfile by executing docker build -t image . within the ./docker directory.

Next, the analysis can be started by calling

source("scripts/drake.R")
make(plan, keep_going = TRUE, console_log_file = stdout())

Dependency graphs

The dependency graph of this analysis (subjective grouping) can be visualized with the following code.

Note that intermediate files have been gathered into the following categories:

Data
Task
Learner
mlr_settings
benchmark
prediction

vis_drake_graph(drake_config(), group = "stage", clusters = c("data", "prediction",
                                                              "mlr_settings"),
                targets_only = TRUE, show_output_files = FALSE,
                navigationButtons = FALSE, selfcontained = TRUE,
                file = "drake.png") +
  ggpubr::theme_pubr()

If all intermediate objects should be visualized (not recommended):

vis_drake_graph(drake_config()) + ggpubr::theme_pubr()

wlandau / pathogen-modeling