Memory issues when reading a large collection
demirev opened this issue · comments
I have an issue where reading a big collection from mongo results in additional memory being taken up by the R process and never released back.
E.g. in the snippet below I create some data with 1.6M rows and 16 columns which I insert in mongo:
fakeData <- map_dfr(1:50000, function(x) bind_cols(iris[1:32,], mtcars))
connection <- mongo(
db = "test",
collection = "test0",
url = "mongodb://localhost:27017"
)
connection$insert(fakeData)
Then after restarting the R process I run
library(dplyr)
library(purrr)
library(mongolite)
connection <- mongo(
db = "test",
collection = "test0",
url = "mongodb://localhost:27017"
)
fakeReadout <- connection$find('{}')
rm(fakeReadout)
connection$disconnect()
rm(connection)
gc(full = T)
After reading and then removing the data (as well as disconnecting the collection and running gc
just in case), I now have an empty workspace, but R takes up ~18% of my RAM as indicated by top
.
Is this some issue related to mongolite's memory management, or am I missing something?
PS: the output of sessionInfo
:
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTSMatrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=bg_BG.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=bg_BG.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=bg_BG.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=bg_BG.UTF-8 LC_IDENTIFICATION=Cattached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] mongolite_2.3.0 purrr_0.3.4 dplyr_1.0.4loaded via a namespace (and not attached):
[1] assertthat_0.2.1 crayon_1.3.4 R6_2.5.0 jsonlite_1.7.1 DBI_1.1.0 lifecycle_0.2.0 magrittr_2.0.1 pillar_1.4.6
[9] rlang_0.4.10 rstudioapi_0.13 blob_1.2.1 vctrs_0.3.6 generics_0.1.0 ellipsis_0.3.1 tools_4.0.3 glue_1.4.2
[17] compiler_4.0.3 askpass_1.1 pkgconfig_2.0.3 openssl_1.4.3 tidyselect_1.1.0 tibble_3.0.4
I think R keeps some memory in its pool even after the objects have been collected. Did you rule out that the problem does not appear without mongolite?
Yes - just reading the same dataset from RDS results in only 4% memory allocation to R (out of 16G), so 4-5x difference compared to using find
That doesn't proof that it is leaking though. If you run your original experiment several times, does it keep adding more and more memory, or does it stabilize?
The simplification step in connection$find()
just needs a lot of memory, which R might keep around for future allocations.
I think it stabilizes, so it could be just the normal behavior. I'll confirm tomorrow after I try it a couple more times though.
If this turns out to be the case do you have any advice on how to reduce the memory usage of find
? Or maybe just a rule of thumb how much memory I need relative to the size of the data I'm trying to read?
I originally ran into this in a workflow where the data I want to read is small enough to fit in memory, but running the query takes up all the resources and the OOM killer steps in.
The most memory expensive part is the final operation which has to combine all of the json documents into one big dataframe.
If you set a handler
function in your find()
to incrementally process chunks of data, then it will probably need much less memory.
Thanks, I'll try that. Meanwhile I'll mark this as closed, since it doesn't seem to be a leak
Just for completeness:
I solved my memory issues by using a custom handler as suggested. Specifically:
findWithHandler <- function(connection, ...) {
handler <- R6::R6Class(
classname = "mongo_handler",
public = list(
data = list(),
handle = function(x) {
count <- parent.frame(n = 2)$count # count comes from within the mongo_stream_in function
self$data[[as.character(count + length(x))]] <- x
},
reduce = function() {
dplyr::bind_rows(self$data[order(as.numeric(names(self$data)))])
}
)
)
thisHandler <- handler$new()
connection$find(handler = thisHandler$handle, ...)
thisHandler$reduce()
}
Now that is almost the same thing that happens within the mongo_stream_in
function if no handler is supplied.
The only difference is that mongo_stream_in
stores the object page
(which is the output of mongo_cursor_next_page
) - a direct conversion of the mongo json objects to lists - in the out
environment at each step (and converts all page
objects to a single data frame in the end), while the handler above stores data frames at each step and binds them together to return the final data (the argument passed to the handler is as.data.frame(simplify(page))
).
It turns out that as.data.frame(page)
is just much much smaller than page
. For the data I was working with it was something like 5x+ difference in object.size
, so avoiding storing the intermediate lists and just working with data frames throughout enabled me to load the data in memory.