Retaining byte string serialization variants

Question

Retaining byte string serialization variants

chrysn opened this issue a year ago · comments

Byte strings have two wide-spread serialization variants: 'text' and h'74657874' (and the rarer b32, h32 and b64, which I personally don't care about but hey they're there) prefixes. It would be nice if this could be preserved, maybe as an extra Option property of ByteString.

Looking at RFC8610 Appendix G Extended Diagnostic Notation provides even more options (including internal whitespace and embedded CBOR); they are more complex and not really on my wish-list, but it might be good to be aware of it when implementing to not duplicate work if that later becomes relevant.

This would be especially convenient when building a diagnostic notation programmatically.

This would probably share patterns with #117, in that it is a property that is set when coming from DN, but unavailable when coming from CBOR. Filling out those gaps when going from arbitrary CBOR to DN could be done by the user at the AST stage by applying arbitrary heuristics, some of which may be provided by cbor-diag-rs, but that's ultimately application specific. (For example, a simple universal heuristic would be taking the ratio of printable ASCII characters; a more application specific choice might be guided by CDDL).

chrysn · Answer 1 · Thu Jul 27 2023 06:41:07 GMT+0800 (China Standard Time)

Looking at the implementation a bit more closely, I noticed when serializing into diagnostic notation, the tags that indicate special handling on the JSON side are conveniently used to also guide display in diagnostics.

While it's perfectly practical to keep handling of those tags in there, a full solution to retaining serialization details would allow moving that code out into a more heuristic annotation step. It could look like this: Binary CBOR doesn't get any diagnostic-format hints at parsing time, and all unannotated byte strings are expresssed by the diagnostic encoder's default. But if the tree is passed to a mutating walker inbetween that fills hints, some being to interpret the tags, then that step would fill the gaps. As a benefit, there'd be the option for the user to either preserve the serialization types for data ingested from diagnostic notation, or to clear them all out to purely apply the encoder's preferences, or to replace the original versions with what the (or, moreover, some) annotator sets.

chrysn · Answer 2 · Thu Jul 27 2023 08:07:44 GMT+0800 (China Standard Time)

One aspect of this is that not only do the strings have serialization variants, their diagnostic notation may also be a concatenation of differently encoded chunks. I'm not sure what would be a good level of modelling here. Full round-tripping of arbitrarily diagnostic notation strings may or may not be desirable; if it is not (and I wouldn't need it), preservation of diagnostic notation would be best-effort. (So for example, h'4141' and 'AA' might roundtrip, but h'41' 'A' would become either of them).

If we went for full roundtripping, options would be to have ByteString contain a single Vec and a parallel Vec<(length, encoding)> hints (easy to manipulate on the CBOR side), or a Vec of single encoded byte strings (easy to manipulate on the diagnostic notation side). But I'm not sure we need it, hope we don't.