grantjenks / python-diskcache

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

Home Page:http://www.grantjenks.com/docs/diskcache/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache is not successfully retrieved using memoize decorator with big nested object as function input

rderollepot opened this issue · comments

Hello,

First of all, thanks for this great library !

I'm having what looks like a strange behaviour using the memoize decorator, and despite I'm not sure the way I'm trying to use the library completely fits its original purpose, I'm reporting it in case it helps to find a hidden bug.

My use case is the following: for map matching purpose, I need to compute some parameters used by the algorithm that only depend on a road network graph. Some of them are computationally demanding and that's why I'm interesting in caching them to disk. However, my input parameter (the road network graph) is an object from a custom class that stores a big networkx graph and some pandas Dataframes with information about the nodes and edges, so it's quite heavy and that's why I'm not sure this case fit in the orignal purpose of the library, but I tried anyway.

Somehow, the cache does not always work, but I think I found some reproducible patterns. Despite digging into the code I could not find a bug and submit a PR. It seems to lead to the pickle or sqlite3 libraries and I don't know how to dig deeper, so instead I submit the file below with the strange patterns I identified, which will hopefully help in finding a bug or show what I am doing wrong. I tried to remove as much stuff as possible while still being able to replicate the bug, but you still need my big RoadNet object to run the script: I'll add a link in the bottom of the post if you want to download it. The script has 5 test cases, and is intended to be executed twice per case to see if cache works.

My environment is macOs Ventura 13.1, Python 3.10, diskcache 5.4 (& scikit-learn 1.1.3), using PyCharm 2022.3.1

from diskcache import Cache
import pickle
from sklearn.neighbors import BallTree

cache = Cache(r'cache')

@cache.memoize()
def my_first_func(my_obj):
    print("Getting in my_first_func...")
    tree = BallTree(my_obj.nodes_atts_df[['y', 'x']].values)
    return tree

@cache.memoize()
def my_second_func(my_obj):
    print("Getting in my_second_func...")
    tree = BallTree(my_obj.nodes_atts_df[['y', 'x']].values)
    return tree

@cache.memoize()
def my_third_func(my_ndarray):
    print("Getting in my_third_func...")
    tree = BallTree(my_ndarray)
    return tree

@cache.memoize()
def my_fourth_func(my_obj):
    print("Getting in my_fourth_func...")
    tree = BallTree(my_obj.nodes_atts_df[['y', 'x']].values)
    return 0

@cache.memoize()
def my_fifth_func(my_obj):
    print("Getting in my_fifth_func...")
    return my_obj

if __name__ == '__main__':
    PATH_TO_FILE = r'data/RoadNet_bordeaux.pickle'
    TEST_CASE = 1

    with open(PATH_TO_FILE, 'rb') as handle:
        RoadNet = pickle.load(handle)

    if TEST_CASE == 1:
        # cache doesn't work
        my_first_func(RoadNet)
    elif TEST_CASE == 2:
        # cache still doesn't work
        my_first_func(RoadNet)
        # cache surprisingly works
        my_second_func(RoadNet)
    elif TEST_CASE == 3:
        # cache works fine
        my_third_func(RoadNet.nodes_atts_df[['y', 'x']].values)
    elif TEST_CASE == 4:
        # cache doesn't work
        my_fourth_func(RoadNet)
    elif TEST_CASE == 5:
        # cache works fine
        my_fifth_func(RoadNet)

And here is the file (link expires February 23rd) : https://filesender.renater.fr/?s=download&token=62b6b17c-0547-4e5b-a426-265f460ce55f

Hope it helps !

Best,
Romain

When the object is serialized, how big is it? Sometimes the cache evicts a large item due to the default size limit of 1GB.

If that’s not it then the pickle serialized bytes are likely different from one call to the next. This problem has been reported before (try searching the issues). Use a different serializer if possible.

One more note, large keys are not ideal. Consider using a pattern of hashing instead like:

def func(big):
    @cache.memoize()
    def _func(sig):
        nonlocal big
        …
    sig = sha256(serialize(big)).digest
    return _func(sig)

Then the cache only needs to store 256 bits as the key. And you can more easily inspect the hash signature of the object.

Thanks for your quick answer and all those ideas !

The serialized pickle file takes 45 Mo on disk.

I had notice the issue #54 yeah (which I think you're referring to ?) but I thought this had been solve using pickletools.optimize. Reading again your caveats section I realize now that you already mentioned it was still not completely sure. Also, cases 2 and 5 above where cache works fine while still using the big object as input took me away from that explanation. What do you think might explain that out of curiosity (and for my mental healthiness 😄) ?

Thanks also for this nice hashing pattern suggestion which I'll try immediately !

Using your suggested pattern I managed to do what I was trying to accomplish, thanks 👍

That second link is a great find and is likely happening here and elsewhere. I wonder if it can be fixed in pickle? Seems like a rather surprising behavior. Note that DiskCache already runs pickletools.optimize to improve the likelihood of identical serialized byte strings.

In the meantime, I suggest using a different serializer, like JSONDisk which is more consistent but supports less types.

Yeah I'd hope this behavior would be fixed in pickle, but one of the answers from SO says:

But you really shouldn't assume that pickling the same object will always give you the same result

so I'm not sure this is an unexpected behavior from pickle, and it kinds of mess with my mind representation of what pickle is 😵‍💫

Using your suggested pattern and modifying what I supply as input of the cached function, I managed to implement proper caching which work consistently so far, so my problem is solved and the issue can be closed.

This pickle thing still itches me though... 😄