JSON Unmarshal of some Unicode escape sequence results in error

Question

JSON Unmarshal of some Unicode escape sequence results in error

denis-zyk opened this issue 6 months ago · comments

What version of protobuf and what language are you using?

protoc --version
libprotoc 23.4

protoc-gen-go --version
protoc-gen-go v1.31.0

Golang lib:

google.golang.org/protobuf v1.30.0

What did you do?
Steps to reproduce the behavior:

Have a JSON string containing the following Unicode escape sequence \udca0, e.g. {"name":"'unsafe-eval'; object?src\udca0'none'"}.
Unmarshal byte slice of JSON string in question to protobuf message using google.golang.org/protobuf/encoding/protojson package.
Got error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Code snippet for reproducing issue:

package main

import (
	"encoding/json"
	"fmt"
	"google.golang.org/protobuf/encoding/protojson"
	
	"pkg.my.package/microservices/models"
)

func main() {
	j1 := []byte(`{"name":"'unsafe-eval'; object?src\uc2a0'none'"}`)	# unmarshal is OK
	j2 := []byte(`{"name":"'unsafe-eval'; object?src\udca0'none'"}`)	# unmarshal error in protojson

	// protojson lib

	// models.Client is a golang struct compiled using `protoc` binary, from Client protobuf message.
	// Protobuf message for the sake of reproducing has 1 string field:
	//   Name       string       `protobuf:"bytes,1,opt,name=name,proto3" json:"name"`
	outProto1 := &models.Client{}
	outProto2 := &models.Client{}

	unmarshaler := protojson.UnmarshalOptions{DiscardUnknown: true}

	err1 := unmarshaler.Unmarshal(j1, outProto1)
	err2 := unmarshaler.Unmarshal(j2, outProto2)

	fmt.Println("protojson lib")
	fmt.Println(string(j1), err1, outProto1)
	fmt.Println(string(j2), err2, outProto2)

	// core lib
	outCore1 := make(map[string]interface{})
	outCore2 := make(map[string]interface{})

	err1 = json.Unmarshal(j1, &outCore1)
	err2 = json.Unmarshal(j2, &outCore2)

	fmt.Println("core lib")
	fmt.Println(string(j1), err1, outCore1)
	fmt.Println(string(j2), err2, outCore2)
}

The example code snippet above produces the following output:

protojson lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> name:"'unsafe-eval'; object?src슠'none'"
{"name":"'unsafe-eval'; object?src\udca0'none'"} proto: syntax error (line 1:9): invalid escape code "'none'" in string 
core lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> map[name:'unsafe-eval'; object?src슠'none']
{"name":"'unsafe-eval'; object?src\udca0'none'"} <nil> map[name:'unsafe-eval'; object?src�'none']

What did you expect to see?
Unmarshalled value name:"'unsafe-eval'; object?src�'none'". I.e. same unmarshalled string value as in Golang core JSON lib.

What did you see instead?
Unpopulated message and an error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Anything else we should know about your project / environment?
Reproduced & confirmed using the followed golang versions and environments:

Linux: go version go1.20.11 linux/amd64
macOS: go version go1.21.4 darwin/arm64

Damien Neil · Answer 1 · Tue Nov 21 2023 00:22:49 GMT+0800 (China Standard Time)

U+DCA0 is a Unicode surrogate, and should not appear in valid UTF-8 text. A conformant UTF-8 decoder is required to reject input containing an unpaired surrogate; see "How do I convert an unpaired UTF-16 surrogate to UTF-8?" here: https://unicode.org/faq/utf_bom.html

I'm not sure about the history that led to encoding/json accepting this invalid input and protojson rejecting it, but I don't think protojson is doing the wrong thing here.

Cassondra Foesch · Answer 2 · Tue Nov 21 2023 01:41:49 GMT+0800 (China Standard Time)

I think this is still generating an improper error message. It seems to be complaining that the invalid escape code is 'none' (which is not at all being used as an escape code), when it should probably be reporting the invalid escape code to be dca0, right?

Damien Neil · Answer 3 · Tue Nov 21 2023 01:50:28 GMT+0800 (China Standard Time)

Yes, you're right; I wasn't looking at the error message. The error is definitely wrong, or at least confusing.

Also, the protojson parser expects a surrogate to be followed by \uXXXX. I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

Cassondra Foesch · Answer 4 · Tue Nov 21 2023 02:40:01 GMT+0800 (China Standard Time)

Yeah… not sure what the proper behavior should be in error conditions. But also, do we check at all that a Hi Surrogate is followed by a Lo Surrogate, and vice versa, or do we just check that surrogates occur in pairs?

I smell the familiar scent of a rabbit role into a pedantic reading of standards… 😩 At least the minimal solution here seems to be that if a surrogate isn’t followed by a surrogate, we report the escape itself as invalid? Or some sort of message that surrogates must be paired?

Joe Tsai · Answer 5 · Tue Nov 21 2023 03:28:45 GMT+0800 (China Standard Time)

The JSON mapping for protobuf follows RFC 7493, which requires strict checking of surrogate halves (see section 2.1 of RFC 7493). The "encoding/json" package only adheres to RFC 8259, which leaves split surrogate halves as undefined behavior (see section 8.2 of RFC 8259). As a side note, the "encoding/json/v2" draft proposal will target compliance with RFC 7293 by default.

I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

A standalone surrogate half is invalid and should be rejected. The validity of a surrogate pair is determine using utf16.DecodeRune.

I don't see anything wrong with what protojson is doing. It correctly identified a surrogate half and was now expecting another escaped surrogate half. The error message says:

proto: syntax error (line 1:9): invalid escape code "'none'" in string

Strictly speaking, this error message is correct as 'none' is literally an invalid escape code.

Joe Tsai · Answer 6 · Tue Nov 21 2023 03:31:34 GMT+0800 (China Standard Time)

The proposed future "encoding/json/jsontext" package produces a better error message:

jsontext: invalid surrogate pair `\udca0'none'` within string

as it preserves the context that this is occurring within a surrogate pair.

Cassondra Foesch · Answer 7 · Tue Nov 21 2023 09:09:19 GMT+0800 (China Standard Time)

Interesting, I see what you mean with the “'none' is an invalid escape code” message. It also just coincidentally has the same length one would expect a \uXXXX escape sequence to be.

I do like the error message from jsontext which makes it clear exactly what is actually going wrong.

lfolger · Answer 8 · Tue Nov 21 2023 20:47:23 GMT+0800 (China Standard Time)

Improving the error message seems reasonasble to use and we are happy to accept a contribution.