joblib / joblib

Computing with Python functions.

Home Page:http://joblib.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ambiguous hash result for pydantic objects with default attributes

SiLiKhon opened this issue · comments

I've been trying to define a cached pipeline parameterized by a pydantic object as the only input to my function. I noticed however that joblib.hash() result depends on whether or not you pass the value for the default-valued attribute at construction (even if your passed value is same as default). See MRE:

import pydantic
import joblib

class Config(pydantic.BaseModel):
    field1: int
    field2: int = 5

c1 = Config(field1=0)
c2 = Config(**c1.dict())
c3 = Config(**c1.dict())

print(f"{(c1 == c2)=}")
print(f"{(c2 == c3)=}")
print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")
print(f"{(joblib.hash(c2) == joblib.hash(c3))=}")
print(f"{joblib.__version__=}")
print(f"{pydantic.__version__=}")

Output:

(c1 == c2)=True
(c2 == c3)=True
(joblib.hash(c1) == joblib.hash(c2))=False
(joblib.hash(c2) == joblib.hash(c3))=True
joblib.__version__='1.3.2'
pydantic.__version__='1.10.13'

Expected True in all comparisons.
If I remove the default field (field2: int = 5) everything works as expected.

I might've found the cause of this issue: I notice pydantic objects have the following __fields_set__ private attribute that differs in this case:

# continuing from the MRE above
print(f"{c1.__fields_set__=}")
print(f"{c2.__fields_set__=}")
print(f"{c3.__fields_set__=}")

print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")
c1.__fields_set__.add("field2")
print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")

Output:

c1.__fields_set__={'field1'}
c2.__fields_set__={'field2', 'field1'}
c3.__fields_set__={'field2', 'field1'}

(joblib.hash(c1) == joblib.hash(c2))=False
(joblib.hash(c1) == joblib.hash(c2))=True

In pydantic v2, they state that this attribute is designed to only store the names for the fields explicitly set (note that the name of this attribute has changed to __pydantic_fields_set__ in v2).

The joblib hash is based on the pickle representation of the hashed object. For pydantic.BaseModel, this representation includes _field_set (see here), which is why these two obiects' hashes differ.

I don't think we can have a solution for this in joblib as this depends on the behavior of pydantic, but we can probably improve the doc to make this easier to debug, stating that caching only work with objects that can be cached consistently.

Closing this as with the PR adding a note in the doc.
Feel free to reopen if there is something we can do to improve this in joblib.