Advanced features of R, things I've learned, notes, templates, and benchmarks. In this course I include things I learned during package development.
- Better string interpolation:
stringr::str_interp
. - Better loops: conclusion
lapply
is better, if you have big data usefuture.apply::future_lapply()
- Consider the shape of your data, if you have a lot of columns calculations will take a long time, parallelisation might be useful.
- If you must use a
for
loop, initialise its vector output:results <- vector(mode = "list", length = how_many_elements_do_you_expect)
- Use this function for printing in parallel code:
messageParallel <- function(...) {system(sprintf('echo "%s"', paste0(..., collapse = "")))}
- Use
data.table
:data.table
is an extension ofdata.frame
they are compatible.- Reshape your data (longer/wider) with
data.table
'smelt()
anddcast()
- a lot faster. - Read/write
csv
files withdata.table::fread()
/data.table::fwrite()
.
- Matrices take up less space.
- S4 class slot type check happens before class initialisation.
Here is some code and syntax I found myself using and re-using.
Typically in R
people build strings this way using paste
.
how_long <- "very long"
paste('This is a', how_long, 'string...', sep = " ")
## [1] "This is a very long string..."
A more natural and JavaScript
way of building strings is using stringr
:
how_long <- "very long"
stringr::str_interp('This is a ${how_long} string...')
## [1] "This is a very long string..."
This allows for more natural syntax in the following way:
for (i in seq_along(iris[,-ncol(iris)])) {
print(stringr::str_interp('The mean for the column ${colnames(iris)[i]} is: ${round(mean(iris[,i]), 5)}'))
}
## [1] "The mean for the column Sepal.Length is: 5.84333"
## [1] "The mean for the column Sepal.Width is: 3.05733"
## [1] "The mean for the column Petal.Length is: 3.758"
## [1] "The mean for the column Petal.Width is: 1.19933"
If you come from other programming languages you might instinctively use a for
loop in this way.
separate_species <- vector(mode = "list", length = length(unique(iris$Species)))
for (i in seq_along(unique(iris$Species))) {
separate_species[[i]] <- iris[as.character(iris$Species) == as.character(unique(iris$Species)[i]), -ncol(iris)]
max_val <- max(separate_species[[i]])
print(max_val)
}
## [1] 5.8
## [1] 7
## [1] 7.9
This has the side effect of polluting our global environment with variables.
max_val
## [1] 7.9
i
## [1] 3
In we can use something much better, apply
type loops in conjunction with list
s. This allows us to manipulate our data inside anonymous functions, and use lapply
. The apply family of functions have a couple of advantages:
- These are optimised in
R
and run onC
code a lot faster thanfor
loops. - These are embarrassingly parallel problems; these can easily be swapped out for
future
versions and run in parallel -future.apply
package.
species_names <- as.character(unique(iris$Species))
names(species_names) <- species_names
apply_sep <- lapply(species_names, function(species_names) {
species <- iris[iris$Species == species_names, -ncol(iris)]
print(max(species))
return(species)
})
## [1] 5.8
## [1] 7
## [1] 7.9
Let's benchmark these two methods. Applies also work over data.frames
, use all applies for going over columns, and apply
with a margin of 1
for going over rows.
Let's start by simulating some data.
generated_data <- (function(offset_min, offset_max, num_cols, num_samples) {
generated_data <- list(
subject_a = replicate(n = num_samples, expr = runif(n = num_cols, min = 10, max = 15), simplify = "data.frame"),
subject_b = replicate(n = num_samples, expr = runif(n = num_cols, min = 10 + offset_min, max = 15 + offset_max), simplify = "data.frame")
)
generated_data <- mapply(function(data, group_name) { # this is a multivariate apply
data <- as.data.frame(t(data))
colnames(data) <- paste(rep("gene", ncol(data)), 1:ncol(data), sep = "_")
data$subject <- group_name
data$condition <- "control"
return(data[,c("subject", "condition", setdiff(colnames(data), c("subject", "condition")))])
}, generated_data, names(generated_data), SIMPLIFY = FALSE)
bound_data <- do.call("rbind", unname(generated_data))
bound_data$subject <- make.names(bound_data$subject, unique = TRUE)
bound_data[((nrow(bound_data) / 2) + 1):nrow(bound_data),]$condition <- "test"
return(bound_data)
})(100, 150, num_cols = 10000, num_samples = 40)
# let's save this for later
data.table::fwrite(generated_data, "./data/generated-data.csv")
We will generalise the above into a function:
generateData <- function(offset_min, offset_max, num_cols, num_samples) {
generated_data <- list(
subject_a = replicate(n = num_samples, expr = runif(n = num_cols, min = 10, max = 15), simplify = "data.frame"),
subject_b = replicate(n = num_samples, expr = runif(n = num_cols, min = 10 + offset_min, max = 15 + offset_max), simplify = "data.frame")
)
generated_data <- mapply(function(data, group_name) { # this is a multivariate apply
data <- as.data.frame(t(data))
colnames(data) <- paste(rep("gene", ncol(data)), 1:ncol(data), sep = "_")
data$subject <- group_name
data$condition <- "control"
return(data[,c("subject", "condition", setdiff(colnames(data), c("subject", "condition")))])
}, generated_data, names(generated_data), SIMPLIFY = FALSE)
bound_data <- do.call("rbind", unname(generated_data))
bound_data$subject <- make.names(bound_data$subject, unique = TRUE)
bound_data[((nrow(bound_data) / 2) + 1):nrow(bound_data),]$condition <- "test"
return(bound_data)
}
Let's do a simple t-test a relatively simple operation; here's a preview of our data:
head(generated_data[, 1:5])
## subject condition gene_1 gene_2 gene_3
## 1 subject_a control 13.59077 14.76026 11.60643
## 2 subject_a.1 control 12.98023 12.75922 11.05997
## 3 subject_a.2 control 12.48429 11.62539 13.36036
## 4 subject_a.3 control 10.34328 13.32511 10.81589
## 5 subject_a.4 control 11.94569 13.62541 13.40303
## 6 subject_a.5 control 11.58461 12.90442 10.59633
tail(generated_data[, 1:5])
## subject condition gene_1 gene_2 gene_3
## 75 subject_b.34 test 152.5395 132.5556 157.4977
## 76 subject_b.35 test 128.8987 148.8747 131.3119
## 77 subject_b.36 test 128.8214 121.0501 152.2529
## 78 subject_b.37 test 132.7050 121.8610 156.1513
## 79 subject_b.38 test 151.6955 125.5584 148.8270
## 80 subject_b.39 test 112.0891 155.2052 149.5747
dim(generated_data)
## [1] 80 10002
Let's start benchmarking with a dataset of size: columns 10 k, rows 40.
Now let's do our calculations over these data; we will use 4 different methods to loop the data.
- Uninitialised for loop; this is a loop which grows a result vector. What happens here is that memory is allocated for every n size vector created at each iteration - not efficient.
- Initialised for loop; here we predict how big our results vector will be and create it at that size from the begining.
- Apply
lapply
loop. No need to create a vector this is done for us from the input data. - Future
future_lapply
the same as above but this code is run in parallel. As you might notice this might not be a lot faster thanlapply
; future parallel code will beat sequential code when the datasets become very large; the job is split up across each core of your processor.
for_vs_lapply <- microbenchmark::microbenchmark(
for_loop_uninitialised = ({
results <- vector()
for (i in 1:(ncol(generated_data) - 2)) {
results[i] <- t.test(
generated_data[, -c(1, 2)][generated_data$condition == "control", i],
generated_data[, -c(1, 2)][generated_data$condition == "test", i],
var.equal = TRUE
)$p.value
}
}),
for_loop_initialised = ({
results <- vector(mode = "list", length = (ncol(generated_data) - 2))
for (i in 1:(ncol(generated_data) - 2)) {
results[i] <- t.test(
generated_data[, -c(1, 2)][generated_data$condition == "control", i],
generated_data[, -c(1, 2)][generated_data$condition == "test", i],
var.equal = TRUE
)$p.value
}
}),
lapply_loop = ({
lapply(generated_data[,-c(1, 2)], function(column) {
# print(data)
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
})
}),
future_lapply = ({
future::plan(strategy = "multisession", workers = future::availableCores())
future.apply::future_lapply(generated_data[,-c(1, 2)], function(column) {
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
})
}),
times = 3
)
ggplot2::autoplot(for_vs_lapply)
Consider the shape of your data, how many subjects how many observations? Now let's up the data size to 10k columns and 1.5k rows.
generated_data <- generateData(100, 150, num_cols = 7500, num_samples = 1500)
Again the same calculations as above a t-test.
for_vs_lapply <- microbenchmark::microbenchmark(
for_loop_uninitialised = ({
results <- vector()
for (i in 1:(ncol(generated_data) - 2)) {
results[i] <- t.test(
generated_data[, -c(1, 2)][generated_data$condition == "control", i],
generated_data[, -c(1, 2)][generated_data$condition == "test", i],
var.equal = TRUE
)$p.value
}
}),
for_loop_initialised = ({
results <- vector(mode = "list", length = (ncol(generated_data) - 2))
for (i in 1:(ncol(generated_data) - 2)) {
results[i] <- t.test(
generated_data[, -c(1, 2)][generated_data$condition == "control", i],
generated_data[, -c(1, 2)][generated_data$condition == "test", i],
var.equal = TRUE
)$p.value
}
}),
lapply_loop = ({
lapply(generated_data[,-c(1, 2)], function(column) {
# print(data)
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
})
}),
future_lapply = ({
future::plan(strategy = "multisession", workers = future::availableCores())
future.apply::future_lapply(generated_data[,-c(1, 2)], function(column) {
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
})
}),
times = 3
)
ggplot2::autoplot(for_vs_lapply)
Messaging back to the console is a challenge when using parallel code; use these functions to print to the console from parallel code:
messageParallel <- function(...) {
system(sprintf('echo "%s"', paste0(..., collapse = "")))
}
generated_data <- generateData(100, 150, num_cols = 5, num_samples = 10)
Here we message back from a normal apply
type function.
invisible(mapply(function(column, column_name) {
message(stringr::str_interp('We are on this column: ${column_name}'))
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
}, generated_data[,-c(1, 2)], names(generated_data[,-c(1, 2)]), SIMPLIFY = FALSE))
## We are on this column: gene_1
## We are on this column: gene_2
## We are on this column: gene_3
## We are on this column: gene_4
## We are on this column: gene_5
As you can see using the message
function from paralle code doesn't work.
future::plan(strategy = "multisession", workers = future::availableCores())
invisible(future.apply::future_mapply(function(column, column_name) {
message(stringr::str_interp('We are on this column: ${column_name}'))
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
}, generated_data[,-c(1, 2)], names(generated_data[,-c(1, 2)]), SIMPLIFY = FALSE))
Here we use the messageParallel
function we defined above:
future::plan(strategy = "multisession", workers = future::availableCores())
invisible(future.apply::future_mapply(function(column, column_name) {
messageParallel(stringr::str_interp('We are on this column: ${column_name}'))
t.test(
column[generated_data$condition == "control"],
column[generated_data$condition == "test"],
var.equal = TRUE
)$p.value
}, generated_data[,-c(1, 2)], names(generated_data[,-c(1, 2)]), SIMPLIFY = FALSE))
## We are on this column: gene_1
## We are on this column: gene_2
## We are on this column: gene_3
## We are on this column: gene_4
## We are on this column: gene_5
We should heavily use data.table
. It has been heavily optimised and written mostly in multi-threaded C
and C++
. For instructions on installing a multi-threaded version of data.table
you can check out my guide here: Makevars.
Here I demonstrate melting data; pivot from wide to long dataset and benchmark the different methods.
df_data <- as.data.frame(tidyr::relig_income)
tib_data <- tidyr::as_tibble(tidyr::relig_income)
DT_data <- data.table::as.data.table(tidyr::relig_income) # converts data which is a data.frame to data.table *by reference*
We will be using the following methods for benchmark:
data.table::melt
tidyr::pivot_longer
reshape2::melt
reshape::melt
data.table::melt(DT_data, id.vars = "religion")
tidyr::pivot_longer(tib_data, -religion)
reshape2::melt(df_data, id.vars = "religion")
reshape::melt(df_data, id.vars = "religion")
We will also try these in combination with different data types. Some of these methods cast the data to other types or pass them to other methods; specifically data.table
passes to reshape2
if it receives a data.frame
instead of a data.table
.
bench_pivoting <- microbenchmark::microbenchmark(
dt_longer = data.table::melt(DT_data, id.vars = "religion"),
dt_convert_longer = data.table::melt(data.table::as.data.table(df_data), id.vars = "religion"),
dt_df_longer = data.table::melt(df_data, id.vars = "religion"),
tidyr_longer = tidyr::pivot_longer(tib_data, -religion),
reshape2 = reshape2::melt(df_data, id.vars = "religion"),
reshape = reshape::melt(df_data, id.vars = "religion"),
times = 100
)
ggplot2::autoplot(bench_pivoting) +
ggplot2::labs(title = "Comparing pivot long format data 1000 iterations", subtitle = "data.table is memory efficient - variables by reference not copies")
We previously saved some data; let's try and read it back in. A lot of different options: read.csv
, read_csv
, vroom
, fread
.
reading_csv <- microbenchmark::microbenchmark(
base_read_csv = ({data <- read.csv("./data/generated-data.csv")}),
datatable_fread = ({data <- data.table::fread("./data/generated-data.csv")}),
readr_read_csv = ({data <- readr::read_csv("./data/generated-data.csv")}),
vroom_read_csv = ({data <- vroom::vroom("./data/generated-data.csv")}),
times = 3
)
ggplot2::autoplot(reading_csv)
Now let's write some data. As before data.table
comes out on top again.
read_in_data <- data.table::fread("./data/generated-data.csv")
writing_csv <- microbenchmark::microbenchmark(
base_write_csv = ({write.csv(read_in_data, "./data/write-test-generated-data.csv")}),
datatable_fwrite = ({data.table::fwrite(read_in_data, "./data/write-test-generated-data.csv")}),
times = 3
)
ggplot2::autoplot(writing_csv)
Matrices are better than data.frames. Consider the shape of your data, do you have a large number of columns?
Here in this example we have 200000 rows and 2 columns.
m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
## 1600216 bytes
object.size(d)
## 1600848 bytes
dim(m)
## [1] 200000 2
In this next example we have 200000 columns, and 2 rows.
m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
## 1600216 bytes
object.size(d)
## 25600608 bytes
dim(m)
## [1] 2 200000
In conclusion the more columns we have the larger the object is. This can be mitigated by using a matrix
rather than data.frame
.
I had a question whether class initialisation happens before or after validity check. Specifically I wanted to know if I can pass a list
type object and convert to a data.table
in the initialize
method.
Let's start with a data set; this is a list
of data.table
s.
DT <- data.table::data.table(iris)
DT[, row_name := 1:nrow(DT)]
ls <- list(DT[, c("row_name", "Sepal.Length", "Sepal.Width")], DT[, c("row_name", "Petal.Length", "Petal.Width")])
# merge(ls[[1]], ls[[2]], by = "row_name")
Let's create the class.
InitListDT <- setClass(
Class = "InitListDT",
slots = list(
list_to_dt = "data.table"
),
prototype = list(
list_to_dt = data.table::data.table()
)
)
setMethod("initialize", "InitListDT", function(.Object, ...) {
.Object <- callNextMethod(.Object, ...)
.Object@list_to_dt <- Reduce(function(...) {
merge(..., by = "row_name")
}, .Object@list_to_dt)
return(.Object)
})
InitListDT(list_to_dt = ls)
## Error in validObject(.Object): invalid class "InitListDT" object: invalid object for slot "list_to_dt" in class "InitListDT": got class "list", should be or extend class "data.table"
In my example shown above you can see this is not possible. The class slots are set and checked before they are passed over to the initialisation method. We can solve this by allow for a list
or data.table
type in this slot.
setClassUnion(
"list_OR_data.table",
members = c("list", "data.table")
)
## Warning: class "data.table" is defined (with package slot 'data.table') but no
## metadata object found to revise superClass information---not imported? Making a
## copy in package '.GlobalEnv'
InitListDT <- setClass(
Class = "InitListDT",
slots = list(
list_to_dt = "list_OR_data.table"
),
prototype = list(
list_to_dt = data.table::data.table()
),
validity = function(object) {
if(data.table::is.data.table(object@list_to_dt)) {
if(colnames(object@list_to_dt)[1] != "row_name") {
stop('First column name must be "row_name".')
}
}
}
)
setMethod("initialize", "InitListDT", function(.Object, ...) {
.Object <- callNextMethod(.Object, ...)
.Object@list_to_dt <- Reduce(function(...) {
merge(..., by = "row_name")
}, .Object@list_to_dt)
return(.Object)
})
object <- InitListDT(list_to_dt = ls)
head(object@list_to_dt)
## row_name Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 1 5.1 3.5 1.4 0.2
## 2: 2 4.9 3.0 1.4 0.2
## 3: 3 4.7 3.2 1.3 0.2
## 4: 4 4.6 3.1 1.5 0.2
## 5: 5 5.0 3.6 1.4 0.2
## 6: 6 5.4 3.9 1.7 0.4
A more simple example as follows:
Test <- setClass(
"Test",
slots = list(
yeet = "character"
),
prototype = list(
yeet = character()
),
validity = function(object) {
print('Hello from validity.')
if(object@yeet[1] != "a") {
stop('First element of slot yeet does not equal "a".')
}
}
)
setMethod("initialize", "Test", function(.Object, ...) {
.Object <- callNextMethod(.Object, ...)
print('Hello from initialisation.')
.Object@yeet[1] <- "a"
validObject(.Object)
return(.Object)
})
Test(yeet = c("b", "d", "e"))
## [1] "Hello from validity."
## Error in validityMethod(object): First element of slot yeet does not equal "a".
As we can see the print first comes from the validity check. In the vase of a valid object we can re-check validity after initialisation using the function validObject
.
Test(yeet = c("a", "d", "e"))
## [1] "Hello from validity."
## [1] "Hello from initialisation."
## [1] "Hello from validity."
## An object of class "Test"
## Slot "yeet":
## [1] "a" "d" "e"
I'm wondering if every time a slot gets modified the validation function gets called. We previously created an object of type InitlistDT
. The first row must be named "row_name" if it's a data.table
. Let's try modifying this and see if we get an error. We will do it through a direct modification first and then by using a setter
method.
test <- object@list_to_dt
colnames(test)[1] <- "yeet"
object@list_to_dt <- test
No error was thrown, let's try using a setter
method now.
object <- InitListDT(list_to_dt = ls)
setGeneric("accessSlot<-", function(obj, value) {
standardGeneric("accessSlot<-")
})
## [1] "accessSlot<-"
setMethod("accessSlot<-", "InitListDT", function(obj, value) {
obj@list_to_dt <- value
validObject(obj) # call validation function here
return(obj)
})
accessSlot(object) <- test
## Error in validityMethod(object): First column name must be "row_name".
If we don't explicitly call the validObject
function in out setter
method then nothing is checked.
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: aarch64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Big Sur 11.4
##
## Matrix products: default
## BLAS: /opt/homebrew/Cellar/openblas/0.3.18/lib/libopenblasp-r0.3.18.dylib
## LAPACK: /opt/homebrew/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.29 bslib_0.3.1
## [4] purrr_0.3.4 reshape2_1.4.4 listenv_0.8.0
## [7] colorspace_2.0-2 vctrs_0.3.8 generics_0.1.1
## [10] htmltools_0.5.2 yaml_2.2.1 utf8_1.2.2
## [13] rlang_0.4.12 jquerylib_0.1.4 pillar_1.6.4
## [16] glue_1.6.0 DBI_1.1.1 bit64_4.0.5
## [19] lifecycle_1.0.1 plyr_1.8.6 stringr_1.4.0
## [22] munsell_0.5.0 gtable_0.3.0 future_1.23.0
## [25] codetools_0.2-18 evaluate_0.14 knitr_1.37
## [28] tzdb_0.2.0 fastmap_1.1.0 parallel_4.1.2
## [31] fansi_1.0.2 highr_0.9 Rcpp_1.0.8
## [34] readr_2.1.1 scales_1.1.1 vroom_1.5.7
## [37] jsonlite_1.7.2 bit_4.0.4 farver_2.1.0
## [40] parallelly_1.30.0 microbenchmark_1.4.9 hms_1.1.1
## [43] ggplot2_3.3.5 digest_0.6.29 stringi_1.7.6
## [46] dplyr_1.0.7 grid_4.1.2 cli_3.1.0
## [49] tools_4.1.2 magrittr_2.0.1 sass_0.4.0
## [52] tibble_3.1.6 crayon_1.4.2 tidyr_1.1.4
## [55] future.apply_1.8.1 pkgconfig_2.0.3 ellipsis_0.3.2
## [58] data.table_1.14.2 rstudioapi_0.13 assertthat_0.2.1
## [61] rmarkdown_2.11 reshape_0.8.8 R6_2.5.1
## [64] globals_0.14.0 compiler_4.1.2