python-cachier / cachier

Persistent, stale-free, local and cross-machine caching for Python functions.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cross-machine cache

saloni10 opened this issue · comments

I was going through the docs for cachier. And I am really interested in using this in my project.
There are two features listed in the doc that confuse me a bit:
Local caching using pickle files.
Cross-machine caching using MongoDB.
What does this mean ?

Some context: I have 3 pods(servers) and a single NFS based disk storage mounted on them.
Since cachier is thread safe, I was hoping that this suits my purpose of having persistent cache on the NFS storage. I want to use pickle based caching since dbs like sqllite/mongoDb are not recommended for NFS based storage.

Does it support my purpose ?
Is saving as pickle objects not suited for cross machine caching by cachier ?
What does it mean to have cross-machine caching using MongoDb?
What does local caching using pickle files mean ?

What this means is that different Python kernels - meaning processes - on the same machine can share a cache between themselves using the pickle-based core.

Also, various different Python kernels/processes across different machines can share a cache amongst themselves using the MongoDB-based core, which uses an independent table on a MongoDB server (which you need to set up).

Regarding your question about what will work for you).

  1. I think the MongoDB-based server will work for you out of the box, and very naively (as it works for separate servers, and ignoring the shared NFS disk). But it will require you to host a MongoDB server somewhere.

  2. In your special case, of three servers sharing the same disk, I think the pickle-based will also work for you as a cross-machine case; you just need to have your code referring to the same path - whatever server it runs on - and it should work out of the box (just do not disable cache reloading with pickle_reload=False). But it is not what this package was built for, so I can assure you of that.

I hope this helps. Please reopen the issue if you have any more questions. :)

Thanks for the quick response. Really appreciate.
I tried the pickle solution (only for a single machine right now). There’s another thought in my mind:

  1. As i use the cachier decorator for a function f(x,y,z) and return value for this function for every permutation of arguments is stored in the same pickle file. This pickle becomes really huge with more calls and reading and writing using cachier is thus becoming slow. In my case the pickle file grows around 2-3 Gb and even more.
  2. Is there a workaround where possibly we could have different pickle files for same function calls with different args? To reduce large pickle files?
    Does this ask/question make sense ?

Also, quoting the reply-
“just do not disable cache reloading with pickle_reload=False). But it is not what this package was built for, so I can assure you of that.”

Curious as to why do you think that the package might not support/(be built) for this kind scenario ?

Yes, that is a known issue:
#29

You can definitely could and should suggest a PR to solve this. As long as you provide a configuration flag to turn this on/off and documentation - it's all good.

Regarding your last question: I just never checked whether different machines can share cache via a shared disk. The first problem that comes to mind is that the pickle-based core is thread and process safe because threads/processes acquire a lock when writing to the file. Intuitively, this should work for 3 different machines sharing a drive, otherwise the whole mechanism of shared disks by mounting would crash randomly of file rewrites. But just try it!

Maybe the specific package I use doesn't apply here or something... :)