richfitz / thor

:zap::computer::zap: R client for the Lightning Memory-Mapped Database

Home Page:https://richfitz.github.io/thor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

document / discover map size memory limit?

cboettig opened this issue · comments

Is there a way to determine in advance what's the largest number of key-value pairs / data size I can store? Apologies since this is probably me just totally failing to understand the idea of LMDB (I was just thinking it's a faster BDB...)

Here's a quick example, a dataset of key-value pairs that is about ~100 MB, which I try to map with mput into thor:

curl::curl_download("https://data.carlboettiger.info/data/86/df/86df0d24994a34dfc5e638f8b378c1b6d52ff1a051c12d93aa04984c15bf9624", "test.tsv.gz")
df <- read.table("test.tsv.gz", col.names = c("url", "id"), stringsAsFactors = FALSE)

env <- thor::mdb_env(tempfile())
env$mput(df$id, df$url)

#Error in thor_mput(txn_ptr, db$.ptr, key, value, overwrite, append) : 
#  Error in mdb: MDB_MAP_FULL: Environment mapsize limit reached: thor_mput -> mdb_put (code: -30792)

In general, would you suggest just sticking with a DBI approach for handling large key-value stores? (I'm just looking for the most efficient way to retrieve a value of a key when the whole store won't fit in RAM, ideally with an embedded solution).

I am not sure how to determine the limit, or even how it copes with bigger-than-ram, but this is tunesable with the mapsize argument to mdb_env():

 mapsize: Maximum size database may grow to; used to size the memory
          mapping.  If database grows larger than ``map_size``, an
          error will be thrown and the user must close and reopen the
          ‘mdb_env’.  On 64-bit there is no penalty for making this
          huge (say 1TB). Must be <2GB on 32-bit.

Thanks, perfect! Sorry I missed that. It's a bit unclear what the units are there or the default size, but I'm guessing it's an integer number given in bytes?

#define DEFAULT_MAPSIZE 1048576

suggests the default size is ~ 1Mb?

yeah, with a comment above saying it's too small. I need to do a CRAN update on this asap - any requests for a better default? What size do you need to see for it to work well for you? I've only used thor on very small dbs so haven't actually hit this

I dunno -- what's the downside to setting it too large? Is there any performance hit for setting it much larger than you need it? (it says 'no penalty on 64 bit', so I assume that's okay) What happens if it is bigger than RAM? I'm also not clear if I can grow an existing database by pointing to the same file but calling mdb_env() with a larger size later?

Guess we can experiment, I'll report back if I learn anything. For the moment, maybe 1 GB would be good, though if there's no obvious downside it might be better to set it arbitrarily large.

I'd definitely add whatever the limit is to the function documentation, along with the units. Now that I know about it I'm pretty comfortable setting it for my data, so the default no longer bothers me.

Really enjoying this package!