golang / protobuf

Go support for Google's protocol buffers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JSON Unmarshal of some Unicode escape sequence results in error

denis-zyk opened this issue · comments

What version of protobuf and what language are you using?

protoc --version
libprotoc 23.4

protoc-gen-go --version
protoc-gen-go v1.31.0

Golang lib:

google.golang.org/protobuf v1.30.0

What did you do?
Steps to reproduce the behavior:

  1. Have a JSON string containing the following Unicode escape sequence \udca0, e.g. {"name":"'unsafe-eval'; object?src\udca0'none'"}.
  2. Unmarshal byte slice of JSON string in question to protobuf message using google.golang.org/protobuf/encoding/protojson package.
  3. Got error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Code snippet for reproducing issue:

package main

import (
	"encoding/json"
	"fmt"
	"google.golang.org/protobuf/encoding/protojson"
	
	"pkg.my.package/microservices/models"
)

func main() {
	j1 := []byte(`{"name":"'unsafe-eval'; object?src\uc2a0'none'"}`)	# unmarshal is OK
	j2 := []byte(`{"name":"'unsafe-eval'; object?src\udca0'none'"}`)	# unmarshal error in protojson

	// protojson lib

	// models.Client is a golang struct compiled using `protoc` binary, from Client protobuf message.
	// Protobuf message for the sake of reproducing has 1 string field:
	//   Name       string       `protobuf:"bytes,1,opt,name=name,proto3" json:"name"`
	outProto1 := &models.Client{}
	outProto2 := &models.Client{}

	unmarshaler := protojson.UnmarshalOptions{DiscardUnknown: true}

	err1 := unmarshaler.Unmarshal(j1, outProto1)
	err2 := unmarshaler.Unmarshal(j2, outProto2)

	fmt.Println("protojson lib")
	fmt.Println(string(j1), err1, outProto1)
	fmt.Println(string(j2), err2, outProto2)

	// core lib
	outCore1 := make(map[string]interface{})
	outCore2 := make(map[string]interface{})

	err1 = json.Unmarshal(j1, &outCore1)
	err2 = json.Unmarshal(j2, &outCore2)

	fmt.Println("core lib")
	fmt.Println(string(j1), err1, outCore1)
	fmt.Println(string(j2), err2, outCore2)
}

The example code snippet above produces the following output:

protojson lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> name:"'unsafe-eval'; object?src슠'none'"
{"name":"'unsafe-eval'; object?src\udca0'none'"} proto: syntax error (line 1:9): invalid escape code "'none'" in string 
core lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> map[name:'unsafe-eval'; object?src슠'none']
{"name":"'unsafe-eval'; object?src\udca0'none'"} <nil> map[name:'unsafe-eval'; object?src�'none']

What did you expect to see?
Unmarshalled value name:"'unsafe-eval'; object?src�'none'". I.e. same unmarshalled string value as in Golang core JSON lib.

What did you see instead?
Unpopulated message and an error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Anything else we should know about your project / environment?
Reproduced & confirmed using the followed golang versions and environments:

Linux: go version go1.20.11 linux/amd64
macOS: go version go1.21.4 darwin/arm64

U+DCA0 is a Unicode surrogate, and should not appear in valid UTF-8 text. A conformant UTF-8 decoder is required to reject input containing an unpaired surrogate; see "How do I convert an unpaired UTF-16 surrogate to UTF-8?" here: https://unicode.org/faq/utf_bom.html

I'm not sure about the history that led to encoding/json accepting this invalid input and protojson rejecting it, but I don't think protojson is doing the wrong thing here.

I think this is still generating an improper error message. It seems to be complaining that the invalid escape code is 'none' (which is not at all being used as an escape code), when it should probably be reporting the invalid escape code to be dca0, right?

Yes, you're right; I wasn't looking at the error message. The error is definitely wrong, or at least confusing.

Also, the protojson parser expects a surrogate to be followed by \uXXXX. I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

Yeah… not sure what the proper behavior should be in error conditions. But also, do we check at all that a Hi Surrogate is followed by a Lo Surrogate, and vice versa, or do we just check that surrogates occur in pairs?

I smell the familiar scent of a rabbit role into a pedantic reading of standards… 😩 At least the minimal solution here seems to be that if a surrogate isn’t followed by a surrogate, we report the escape itself as invalid? Or some sort of message that surrogates must be paired?

The JSON mapping for protobuf follows RFC 7493, which requires strict checking of surrogate halves (see section 2.1 of RFC 7493). The "encoding/json" package only adheres to RFC 8259, which leaves split surrogate halves as undefined behavior (see section 8.2 of RFC 8259). As a side note, the "encoding/json/v2" draft proposal will target compliance with RFC 7293 by default.

I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

A standalone surrogate half is invalid and should be rejected. The validity of a surrogate pair is determine using utf16.DecodeRune.

I don't see anything wrong with what protojson is doing. It correctly identified a surrogate half and was now expecting another escaped surrogate half. The error message says:

proto: syntax error (line 1:9): invalid escape code "'none'" in string

Strictly speaking, this error message is correct as 'none' is literally an invalid escape code.

The proposed future "encoding/json/jsontext" package produces a better error message:

jsontext: invalid surrogate pair `\udca0'none'` within string

as it preserves the context that this is occurring within a surrogate pair.

Interesting, I see what you mean with the “'none' is an invalid escape code” message. It also just coincidentally has the same length one would expect a \uXXXX escape sequence to be.

I do like the error message from jsontext which makes it clear exactly what is actually going wrong.

Improving the error message seems reasonasble to use and we are happy to accept a contribution.