Planned features and improvements for Ceras v5

Question

Planned features and improvements for Ceras v5

rikimaru0345 opened this issue 5 years ago · comments

This issue tracks ideas for new features and improvements for the next version of Ceras (v5).

Breaking Changes

The following are changes that can't be implemented in the current version (v4) because they change the binary format Ceras uses.

ReferenceFormatter (Done!)

Together with all the other changes, there is a chance to optimize the most common code-paths taken by the ReferenceFormatter<>!

It turns out that back-references are relatively rare, so we can make them into a special case!
So we can switch from a VarInt to a fixed 1-byte prefix to tells us about the upcoming data.

Simple cases:
- Null
- NewObject: following the object data directly
Extended cases:
- NewDerivedObject: following a Type and the object data
- InlineType: following a Type, used as the object itself
- ExternalObject: following a fixed Int32 for the external ID
- BackReference: following a fixed Int32 for the already encountered object ID

The most common cases by far (99%+) are Null, NewObject, and NewDerivedObject.
The important thing to note here is that all the common cases will profit from much faster read/write performance.

Surprisingly all the uncommon cases will be faster as well, since even though we have an additional 1 byte, there are fewer branches in total (considering that VarInt contains 4+ branches by itself already)

Rework ReferenceFormatter using the scheme described above

Type Serialization (Done!)

Type Codes (for framework types)
In the rare case that Ceras has to embed a Type into the binary, it is written using its full name; which is perfectly fine for user-defined types.
But for framework types (like List<>, Int32, ...) we could write something like a "builtin type-code" to save space.

AotGenerator (Done!)

Instead of generating the content of the formatters using strings; the new AotGenerator should actually construct the expression tree that the "normal" Ceras uses (when not using VersionTolerance); and then convert that into a source-code-string. That way we get some huge advantages:
- Improvements to DynamicFormatter will automatically be available in generated formatters!
- Drastically reduced the potential for bugs, because now the generator doesn't have to essentially "rewrite" what the DynamicFormatter does
- Performance features like merge-blitting are automatically implemented in aot code as well!
Split the old AotGenerator.exe into a .dll and an .exe, so that usage in Unity is much easier. There could be a Unity-script that automatically listens for changes and recompiles the
Add an attribute to generated formatters. That way when CerasAutoGenConfigAttribute is used, Ceras knows that it should ignore any old generated formatters while re-generating them.

Encoding (Done!)

Improved String Encoding
Currently we have to iterate over every string twice because we must know how many bytes it will require (using GetByteCount). That takes time. Another approach is to guess the byte-length, then write, then see if we have to relocate the string (in other words, do it all again).
Every serializer I know of does one of those two things.
I would prefer if Ceras would try to be more efficient by encoding strings in a more intelligent scheme. The idea is that we'd write up to 254 bytes and then, if there are still characters left, encode the remaining bytes in one big block.
The only blocking issue here is that String.Create is only available in .net standard 2.1; and without it we'd pay with a performance hit at deserialization time (having to allocate a char array, then creating a new string from that). However the performance impact might be negligible (memcpy is much faster than the utf8 decoding step), the char array can be thread-local and recycled, and we can completely avoid the hit in netstandard2.1 later; whereas we'd have to live with the not-as-efficient encoding forever if we don't do this change now.

Config

Ability to configure formatter per Member!
- "Late initialization" to allow changes to the TypeConfig for as long as possible (until the first de/-serialization)
- Ensure that declaring types of members using a custom formatter are in fact handled by DynamicFormatter
config.IntegerEncoding
Allow users to decide when they want to use fixed encoding vs variable encoding. For example if you want to use Ceras for networking you want to throw in as much compression as you can, every cpu cycle that goes towards sending less data is worth it. So you could opt to encode all int, short, long, ... with variable encoding, making Ceras use WriteUInt32 instead of WriteUInt32Fixed.
Or, if your aim is to save data to disk (save-game, settings, level-data, game-database...) you want things to go fast, so you can always use fixed encoding, which is larger (always 2/4/8bytes) but much faster.
- UseReinterpretFormatter is now superseded by IntegerEncoding
PersistentName for Types and Members. Influences member order. Enables Ceras to work together with obfuscators.
- Add this new setting to TypeConfig
- Automatically set by [MemberConfig], [DataMember], or member.Name
- Maybe have "config.OnGetPersistentName", so the user can do all sorts of trickery (maybe having encrypted type names even in the attributes, only decrypting them when Ceras needs them)

Version Tolerance

Right now (in v4) when an application wants to read any binary data with Ceras, it must already have the correct Types (classes and structs). In other words, the format must be known.

This is fine for very high-performance use cases, but sometimes you want to trade in a bit of performance for a bit more leeway in terms of compatibility.
For example having a server-client network scenario where the client is slightly outdated...

Or another scenario would be an application wanting to inspect or even modify data when it doesn't know the specific format.

Formats like Json, Xml, or the MsgPack embed additional information so they can be completely self-describing.
With the following improvements, Ceras format can be a self-describing as well!

That way Ceras could even handle simple cases (like changing an int field to a float or something) automatically, and provide an API for the user to handle more complicated cases.

More information in embedded Schema
- Type.Name: allow for types to change their name! Also makes Ceras compatible with obfuscators!
- MemberType: each member also records its type, in addition to the name. Allows for members changing their type, and even automatic conversion.
- Formatter: embed an ID for each used formatter (reinterpret, array, list, varint, dynamic, user, ...). That allows us to be robust against changes in IntegerEncoding, or warn the user when they're trying to read something but some formatter is missing! (maybe the old data was written using a user-created formatter)
Ensure only Schema of types handled by DynamicFormatter/SchemaDynamicFormatter are actually written (but allow users from manually generating/writing a Schema of any type)
Make Schema public and add a OnSchemaRead callback that is called when Ceras loaded a new Schema, and produced some mappings and conversions in order to load the old data. That way users can see how the format changed and what Ceras did to resolve the differences. Also provide a way to save/load a Schema to/from byte[].
(maybe) Inspect / .ToJson()
With all the new information in Schema, it should be possible to allow users to info it should be possible to even (one-way) convert it into a json-string.
config.VersionTolerance.PrefixSize setting to let the user select the prefix size of members (currently fixed UInt32). It should be possible to select ushort, byte, and even varint. Epecially interesting for networking purposes.
config.EmbedSchema setting that you can disable, in which case you're responsible for somehow storing the Schema manually. Could be useful for network scenarios.

Non breaking changes

Type encoding should use size-limited strings to prevent an attacker from overloading the serializer that way.
(Maybe) Special handling for very large structs (>64 bytes). We could have a ISerializeByRef interface implemented by DynamicFormatter, ReinterpretFormatter and ArrayFormatter.
Ensure all lookups of private methods actually work in .NET Core as well (ex "GetUninitializedObject" which is private there)
Try to automatically select a constructor in more cases. Maybe filter the ones we can't use / map, then use the one that takes the most arguments?
When used in Unity: Catch and rethrow MissingMethodException and tell the user what the problem actually is (IL2CPP either removing a method, or not generating a generic instantiation for it). Explain how it can be fixed: Add link.xml for stripped methods. Call generic methods in their closed form beforehand. Maybe we could even generate some code for the user to copy-paste in the latter case.
Support open and half-open MethodInfos. (comment)
Change exception when no ctor is found to tell people about [CerasConstructor]