jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

Home Page:https://jcristharif.com/msgspec/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make msgspec more aware of large data and other serialization protocols

fungs opened this issue · comments

Description

This feature request is more of a discussion about how to use or generalize the current functionality to support large data and other serialization protocols. In my application, I'm looking for a specification and validation layer before feeding the data into a custom binary serialization protocol. Generally, this works well using the to_builtins() myprotocol_dumps() pair. However, there are some assumptions geared towards small data with schemaless and JSON which are suboptimal for my use case.

First observation

The JSON-inspired set of basic Python data types in the intermediate "built-in" state do not match many binary serialization protocols. These protocols typically have a much richer type system, for which we need to pass through corresponding Python objects. On the other hand, protocols requiring a schema often don't have a dict/map type, as that is merely a list of 2-tuples and object attribute names are not being transferred. That is not an issue per se, it just shows that the definition of the intermediate types is just an arbitrary split of the serialization chain with the first part being done by msgspec and custom type reductions and the second part being done by the backend serializer, basically an interface.

Second observation

The intermediate state is currently materialized in memory before being passed to the backend serializer. This means that in worst case, we keep three copies (or more?) of data in memory: original, intermediate and serialized (if it is not able to stream the data right away). The same thing happens with deserialization.

There may be multiple solutions to this, I can think of one right away: take a generator approach to represent those basic objects which can be consumed by the serializer (here dumps()) and vice versa for the other direction.

Third observation

Typed serialization schemes differentiate between more efficient uniform arrays, in which all elements have the same type, and JSON/Python-like dynamic lists of objects. The intermediate state encodes everything as native Python list. Therefore, the backend serializer must be informed about the original object type to select the proper encoding. The same probably goes for differentiating between dictionaries which result from class objects and dictionaries as (derived) container types.

Fourth observation

Right now, if I wanted to do this all right, I probably would have to re-implement the entire encode and decode pipelines including type validation for my backend serializer. At the end, I would just borrow the type specification syntax, struct and utility functions. I'm not sure if validation code could be reused. That shows, that msgspec probably should try to better enable third-party pipelines, as not all serialization protocols could or should be included.

What I didn't mention so far: MSGSPEC IS GREAT WORK!