protojson: AllowDuplicate unmarshal option

Question

protojson: AllowDuplicate unmarshal option

emcfarlane opened this issue 8 months ago · comments

Is your feature request related to a problem? Please describe.
Mapping proto to an existing JSON API where the top level field returns a duplicate field fails with the error:

"proto: (line 17562:5): duplicate field \"num_found\""

This is defiantly a bug in the API, (that I suspect is due to a dev not wanting to scroll to the top to see how many items there are in the list 🙃) but can't currently be ingested by protojson.

Describe the solution you'd like
Opposite of golang/go#48298 for protojson.

Describe alternatives you've considered
Considered using encoding/json to map[string]json.RawMessage re-encoding and then decoding with protojson. This doesn't solve the issue when the field name conflicts to a json name. i.e. numFound and num_found.

Joe Tsai · Answer 1 · Mon Oct 02 2023 08:19:48 GMT+0800 (China Standard Time)

I can't speak on behalf of the protobuf owners today, but I can imagine the following future.

There's a prototype for a new JSON implementation that we'll be more widely discussing in the near future on the golang/go repository. There's a decent chance that it (or something like it) may be adopted as the "encoding/json/v2" package.

If that future arrives, I can imagine that the protojson package add the following functions:

func MarshalEncode(e *jsontext.Encoder, m proto.Message, o json.Option) error
func MarshalDecode(d *jsontext.Decoder, m proto.Message, o json.Option) error

where the existing Marshal and Unmarshal functions are implemented in terms of MarshalEncode and UnmarshalDecode.

These functions could now respect the jsontext.AllowDuplicateNames option (or any other jsontext option) as the JSON serialization is essentially delegated to the jsontext package.

As a fun historical note: The protojson package does not use the encoding/json package at all since it targets a stricter standard of JSON (RFC 7493). Consequently, it implements it's own JSON tokenizer. The v2 JSON prototype can be thought of as a spiritual successor to the tokenizer that's implemented internally in the protobuf module. In that possible future, we should be able to entirely delete google.golang.org/protobuf/internal/encoding/json in favor of encoding/json/jsontext, which would be better supported, much more performant, and RFC 7493 compliant by default.

lfolger · Answer 2 · Wed Oct 04 2023 19:55:21 GMT+0800 (China Standard Time)

It seems the C++ implement doesn't support such an option right now (https://protobuf.dev/reference/cpp/api-docs/google.protobuf.util.json_util/#JsonParseOptions).

We would only consider supporting this if it is supported by all other major languages. This means, I suggest filing a request to generally support this and add it to the proto spec.

If such a request is accepted we can revisit this issue.

Sorry if this is not the answer you were hoping for but language inconsistencies caused some trouble in the past and we are careful with adding features that are not enforced by the spec.

Edward McFarlane · Answer 3 · Wed Oct 04 2023 19:59:15 GMT+0800 (China Standard Time)

Thanks both. Happy to close, I think the default should be to error and understand not wanting to have options for it. Will fix my original issue another way.

Edward McFarlane · Answer 4 · Wed Oct 04 2023 19:59:53 GMT+0800 (China Standard Time)

@lfolger does the C++ implementation check for duplicate fields?

lfolger · Answer 5 · Wed Oct 04 2023 21:01:49 GMT+0800 (China Standard Time)

I think it does but I didn't double check, so don't take my word for it.

I didn't have time to dig through the implementation or write a test.

Edward McFarlane · Answer 6 · Thu Oct 05 2023 02:13:46 GMT+0800 (China Standard Time)

It does look like C++ explicitly accepts multiple keys with the last one wins: https://github.com/protocolbuffers/protobuf/blob/17e06c108dc8a3f43ea0f909999d74d7166f9733/src/google/protobuf/json/json_test.cc#L590-L596
This is consistent with proto encoding. Maybe protojson should do the same for consistency?

Joe Tsai · Answer 7 · Thu Oct 05 2023 02:26:44 GMT+0800 (China Standard Time)

When I implemented protojson, I did so by following an Google-internal "specification" that called for compliance with RFC 7493, which clearly rejects duplicate names. I strongly recommend against changing the default behavior of duplicate name handling as allowing for it can result in a privilege escalation attacks (e.g., CVE-2017-12635), especially in an ecosystem like protobuf where you have multiple different JSON implementations.

This is consistent with proto encoding. Maybe protojson should do the same for consistency?

For proto encoding, there is a well-specified semantic for how "duplicate" names work. This is not the case for JSON (the interchange format itself rather than protobuf's use of it), which leaves it as undefined behavior. Technically, JSON for protobuf specifically, could define a semantic for duplicate names, but that's a poor precedence to set since many JSON-only libraries do not faithfully expose this.

That said, it is an argument for potentially allowing you to configure the behavior, and I think my proposal above would be the cleanest way to expose it (but will take a while before the API appears).

Edward McFarlane · Answer 8 · Thu Oct 05 2023 02:34:48 GMT+0800 (China Standard Time)

Excited for the new json lib and thanks for the clarification. Agree the best solution is for the new API's to be available.