Regression deserializing a subclassed source

Question

Regression deserializing a subclassed source

pchanial opened this issue 2 years ago · comments

The following code works correctly when cythonization is disabled:

from apischema import deserialize

@dataclass
class Foo:
    bar: str

class MyDict(dict):
    pass

data = MyDict({'bar': 'baz'})
expected = Foo(bar='baz')
actual = deserialize(Foo, data)

Cythonization triggers an LSP violation: the last line fails with the error:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
apischema/utils.py:400: in wrapper
    return wrapped(*args, **kwargs)
apischema/deserialization/__init__.py:896: in deserialize
    return deserialization_method(
apischema/deserialization/methods.pyx:463: in apischema.deserialization.methods.SimpleObjectMethod.deserialize
    cpdef deserialize(self, object data):
apischema/deserialization/methods.pyx:464: in apischema.deserialization.methods.SimpleObjectMethod.deserialize
    return SimpleObjectMethod_deserialize(self, data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   data2: dict = data
E   TypeError: Expected dict, got MyDict

apischema/deserialization/methods.pyx:912: TypeError

Joseph Perez · Answer 1 · Mon Feb 07 2022 04:11:06 GMT+0800 (China Standard Time)

I'd indeed use Cython cast to builtin type in order to improve performance, as builtin operations can be optimized. For example, in operator used for object deserialization can use PyDict_Contains instead of PySequence_Contains (this one calling PyDict_Contains using a function pointer).

However, following your issue, I've made a quick benchmark, and the result was surprising: adding the cast (with an additional variable) is in fact slower for small lists (but it's better as expected for bigger lists). That's why I've finally removed this cast.

Joseph Perez · Answer 2 · Mon Feb 07 2022 04:15:38 GMT+0800 (China Standard Time)

@pchanial By the way, could you give more details about your use case with buitin subtypes?

Pierre Chanial · Answer 3 · Mon Feb 07 2022 18:25:38 GMT+0800 (China Standard Time)

Thanks for the quick fix. We were just emulating PEP 584 in some unit tests on a Python 3.8 environment, but there could be less trivial other use cases, such as relying on defaultdict, Counter or OrderedDict.
Btw, it would be great to make public some benchmarks internal to apischema so that contributors could check performance regressions.

Joseph Perez · Answer 4 · Mon Feb 07 2022 19:35:18 GMT+0800 (China Standard Time)

I don't think defaultdict would be a good idea because deserialization uses in operator, but PEP 584 emulation is indeed an understandable use case.

Btw, it would be great to make public some benchmarks internal to apischema so that contributors could check performance regressions.

Benchmark is not a trivial things, and apischema's one is just here to give a rough estimation of the relative performance in comparison to alternatives. But it's really poor and not reliable if you want some precise results, like tracking performance regressions.
I don't have any other benchmark available for the moment, but I don't think it matter a lot. If a performance regression is introduced, I think it will be visible directly in the code, as apischema performance mostly comes from algorithmic optimization.

Pierre Chanial · Answer 5 · Tue Feb 08 2022 08:33:00 GMT+0800 (China Standard Time)

Out of curiosity, why does the deserializer use the in operator?