two dataframes with different values hitting cache incorrectly

Question

two dataframes with different values hitting cache incorrectly

monstrorivas opened this issue 4 years ago · comments

Are pandas dataframes supported as function arguments in a @cached decorated function?

I tried to simplify this example with a smaller dataframe but @cached does seem to behave as one would expect for smaller dataframes.

However, when I tried the minimal code below with the attached data I ran into a problem where the two clearly different dataframes are being interpreted as identical in the @cached decorated function. Thus, df2 doesn't make it through which_df but instead gets the value from the cache since it assumes df2 is equals to df1 (and it is not!)

This is the test to replicate. Please use the attached data get the unexpected behavior explained in this issue

import pandas as pd
from memoization import cached

@cached()
def which_df(df):
#     print("got inside function")
    return df.name
    
    
df1 = pd.read_pickle('memoization_test.pkl')
df1.name = "This is DF No. 1"
df2 = df1.interpolate()
df2.name = "This is DF No. 2"

df1.equals(df2)   # ==> False, since they are not identical
print(which_df(df1) + ', and it should be DF No. 1')
print(which_df(df2) + ', BUT it should be DF No. 2')

memoization_test.zip

lonelyenvoy · Answer 1 · Tue May 12 2020 23:43:54 GMT+0800 (China Standard Time)

Hi monstrorivas,

Thanks for your issue. This was a bug - in order to memorize a pandas dataframe, memoization converted it to a string using str(). memoization assumed that this string exactly represented the internal states of a dataframe. This is true for built-in types, but not for dataframes. That's because when turned into a string, a dataframe omits parts of its content. Take your data for example:

>>> print(str(df1))
                            with_nans
2019-01-12 03:20:30-06:00  655.559113
2019-01-12 03:21:00-06:00  658.763224
2019-01-12 03:21:30-06:00  655.639191
2019-01-12 03:22:00-06:00  651.353745
2019-01-12 03:22:30-06:00  648.590169
...                               ...
2019-02-11 13:18:00-06:00  668.615855
2019-02-11 13:18:30-06:00  673.101573
2019-02-11 13:19:00-06:00  675.024038
2019-02-11 13:19:30-06:00  676.706156
2019-02-11 13:20:00-06:00  663.969849

[87600 rows x 1 columns]

So, two dataframes may be considered equal if you merely compare them using str(), as long as the first 5 lines and the last 5 lines are the same.

To address this issue, I have published a new release v0.3.1. Please run pip install --upgrade memoization to upgrade and read the tutorial about custom cache keys so that pandas dataframes can be properly cached. Feel free to ask for help if needed.

Alberto Rivas · Answer 2 · Wed May 13 2020 01:25:34 GMT+0800 (China Standard Time)

ok, that makes sense. What I did as a workaround is to pickle the dataframe before passing it to the memoized function. Then, I have to deserialize it within the function. Would something like that work in your implementation, instead of using str()?

Could you give me an example of what to use for the custom_key_maker for a dataframe?

Judah Rand · Answer 3 · Tue Oct 19 2021 23:03:12 GMT+0800 (China Standard Time)

Why not assemble all arguments into a single dictionary and pickle.dumps it? The result can be hashed either with hashlib.md5 or xxhash.xx3h_64 or similar.