df module data.frame assembly is too slow
mschubert opened this issue · comments
With hpc_2.0
, it is no problem to run > 1M function calls with very short runtimes and get their results - my tests finished with under 2 hours runtime.
However, the subsequent data.frame assembly takes over 24 hours. This needs to be quicker to be really useful.
Critical part is, in data_frame/call
Rprof()
names(result) = 1:length(result)
index$rep = add_rep
if (!result_only) {
rownames(index) = as.character(1:nrow(index))
result = lapply(names(result), function(i) {
if (is.null(names(result[[1]])))
c(as.list(index[i,,drop=FALSE]), result=as.list(result[[i]]))
else
c(as.list(index[i,,drop=FALSE]), as.list(result[[i]]))
})
}
if (tidy)
result = dplyr::rbind_all(lapply(result, as.data.frame))
Rprof(NULL)
With the following profiling results
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
".Call" 372.52 46.50 376.00 46.94
"pmatch" 223.60 27.91 225.74 28.18
"as.list" 90.30 11.27 327.62 40.90
"match" 14.82 1.85 39.98 4.99
"deparse" 12.38 1.55 51.58 6.44
"data.frame" 7.96 0.99 93.26 11.64
$by.total
total.time total.pct self.time self.pct
"<Anonymous>" 801.06 100.00 1.96 0.24
"lapply" 424.96 53.05 0.88 0.11
"FUN" 424.68 53.01 1.50 0.19
".Call" 376.00 46.94 372.52 46.50
"as.list" 327.62 40.90 90.30 11.27
"[" 235.76 29.43 0.42 0.05
"[.data.frame" 235.34 29.38 4.58 0.57
"pmatch" 225.74 28.18 223.60 27.91
"as.data.frame.list" 97.26 12.14 0.36 0.04
800 s runtime with 145,000 rows (4 columns) total.
Bottleneck seems to be as.list
and dplyr's .Call
in rbind_all
.
14,500 rows .Call
takes < 15% (< 2 s), this does not seem to be O(n) at all.
ref: tidyverse/dplyr#1396
hpc-internal issues solved in db81467, rest dependent on upstream