Regression deserializing a subclassed source
pchanial opened this issue · comments
The following code works correctly when cythonization is disabled:
from apischema import deserialize
@dataclass
class Foo:
bar: str
class MyDict(dict):
pass
data = MyDict({'bar': 'baz'})
expected = Foo(bar='baz')
actual = deserialize(Foo, data)
Cythonization triggers an LSP violation: the last line fails with the error:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
apischema/utils.py:400: in wrapper
return wrapped(*args, **kwargs)
apischema/deserialization/__init__.py:896: in deserialize
return deserialization_method(
apischema/deserialization/methods.pyx:463: in apischema.deserialization.methods.SimpleObjectMethod.deserialize
cpdef deserialize(self, object data):
apischema/deserialization/methods.pyx:464: in apischema.deserialization.methods.SimpleObjectMethod.deserialize
return SimpleObjectMethod_deserialize(self, data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> data2: dict = data
E TypeError: Expected dict, got MyDict
apischema/deserialization/methods.pyx:912: TypeError
I'd indeed use Cython cast to builtin type in order to improve performance, as builtin operations can be optimized. For example, in
operator used for object deserialization can use PyDict_Contains
instead of PySequence_Contains
(this one calling PyDict_Contains
using a function pointer).
However, following your issue, I've made a quick benchmark, and the result was surprising: adding the cast (with an additional variable) is in fact slower for small lists (but it's better as expected for bigger lists). That's why I've finally removed this cast.
@pchanial By the way, could you give more details about your use case with buitin subtypes?
Thanks for the quick fix. We were just emulating PEP 584 in some unit tests on a Python 3.8 environment, but there could be less trivial other use cases, such as relying on defaultdict, Counter or OrderedDict.
Btw, it would be great to make public some benchmarks internal to apischema so that contributors could check performance regressions.
I don't think defaultdict
would be a good idea because deserialization uses in
operator, but PEP 584 emulation is indeed an understandable use case.
Btw, it would be great to make public some benchmarks internal to apischema so that contributors could check performance regressions.
Benchmark is not a trivial things, and apischema's one is just here to give a rough estimation of the relative performance in comparison to alternatives. But it's really poor and not reliable if you want some precise results, like tracking performance regressions.
I don't have any other benchmark available for the moment, but I don't think it matter a lot. If a performance regression is introduced, I think it will be visible directly in the code, as apischema performance mostly comes from algorithmic optimization.
Out of curiosity, why does the deserializer use the in
operator?