ambiguous hash result for pydantic objects with default attributes
SiLiKhon opened this issue · comments
I've been trying to define a cached pipeline parameterized by a pydantic
object as the only input to my function. I noticed however that joblib.hash()
result depends on whether or not you pass the value for the default-valued attribute at construction (even if your passed value is same as default). See MRE:
import pydantic
import joblib
class Config(pydantic.BaseModel):
field1: int
field2: int = 5
c1 = Config(field1=0)
c2 = Config(**c1.dict())
c3 = Config(**c1.dict())
print(f"{(c1 == c2)=}")
print(f"{(c2 == c3)=}")
print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")
print(f"{(joblib.hash(c2) == joblib.hash(c3))=}")
print(f"{joblib.__version__=}")
print(f"{pydantic.__version__=}")
Output:
(c1 == c2)=True
(c2 == c3)=True
(joblib.hash(c1) == joblib.hash(c2))=False
(joblib.hash(c2) == joblib.hash(c3))=True
joblib.__version__='1.3.2'
pydantic.__version__='1.10.13'
Expected True
in all comparisons.
If I remove the default field (field2: int = 5
) everything works as expected.
I might've found the cause of this issue: I notice pydantic
objects have the following __fields_set__
private attribute that differs in this case:
# continuing from the MRE above
print(f"{c1.__fields_set__=}")
print(f"{c2.__fields_set__=}")
print(f"{c3.__fields_set__=}")
print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")
c1.__fields_set__.add("field2")
print(f"{(joblib.hash(c1) == joblib.hash(c2))=}")
Output:
c1.__fields_set__={'field1'}
c2.__fields_set__={'field2', 'field1'}
c3.__fields_set__={'field2', 'field1'}
(joblib.hash(c1) == joblib.hash(c2))=False
(joblib.hash(c1) == joblib.hash(c2))=True
In pydantic v2, they state that this attribute is designed to only store the names for the fields explicitly set (note that the name of this attribute has changed to __pydantic_fields_set__
in v2).
The joblib
hash is based on the pickle representation of the hashed object. For pydantic.BaseModel
, this representation includes _field_set
(see here), which is why these two obiects' hashes differ.
I don't think we can have a solution for this in joblib
as this depends on the behavior of pydantic
, but we can probably improve the doc to make this easier to debug, stating that caching only work with objects that can be cached consistently.
Closing this as with the PR adding a note in the doc.
Feel free to reopen if there is something we can do to improve this in joblib
.