mschubert / ebits

R bioinformatics toolkit incubator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

df module data.frame assembly is too slow

mschubert opened this issue · comments

With hpc_2.0, it is no problem to run > 1M function calls with very short runtimes and get their results - my tests finished with under 2 hours runtime.

However, the subsequent data.frame assembly takes over 24 hours. This needs to be quicker to be really useful.

Critical part is, in data_frame/call

Rprof()
    names(result) = 1:length(result)
    index$rep = add_rep

    if (!result_only) {
        rownames(index) = as.character(1:nrow(index))
        result = lapply(names(result), function(i) {
            if (is.null(names(result[[1]])))
                c(as.list(index[i,,drop=FALSE]), result=as.list(result[[i]]))
            else
                c(as.list(index[i,,drop=FALSE]), as.list(result[[i]]))
        })
    }
    if (tidy)
        result = dplyr::rbind_all(lapply(result, as.data.frame))
Rprof(NULL)

With the following profiling results

> summaryRprof()
$by.self
                           self.time self.pct total.time total.pct
".Call"                       372.52    46.50     376.00     46.94
"pmatch"                      223.60    27.91     225.74     28.18
"as.list"                      90.30    11.27     327.62     40.90
"match"                        14.82     1.85      39.98      4.99
"deparse"                      12.38     1.55      51.58      6.44
"data.frame"                    7.96     0.99      93.26     11.64

$by.total
                           total.time total.pct self.time self.pct
"<Anonymous>"                  801.06    100.00      1.96     0.24
"lapply"                       424.96     53.05      0.88     0.11
"FUN"                          424.68     53.01      1.50     0.19
".Call"                        376.00     46.94    372.52    46.50
"as.list"                      327.62     40.90     90.30    11.27
"["                            235.76     29.43      0.42     0.05
"[.data.frame"                 235.34     29.38      4.58     0.57
"pmatch"                       225.74     28.18    223.60    27.91
"as.data.frame.list"            97.26     12.14      0.36     0.04

800 s runtime with 145,000 rows (4 columns) total.

Bottleneck seems to be as.list and dplyr's .Call in rbind_all.

14,500 rows .Call takes < 15% (< 2 s), this does not seem to be O(n) at all.

hpc-internal issues solved in db81467, rest dependent on upstream