When utf8 is off, reject \uXXXX

Question

When utf8 is off, reject \uXXXX

FGasper opened this issue 2 years ago · comments

When utf8() mode is off this module will happily produce things like this:

> perl -MJSON -MData::Dumper -e'print Dumper( JSON->new()->decode(q<["é\u0100"]>) )'
$VAR1 = [
          "\x{c3}\x{a9}\x{100}"
        ];

This will (almost) always cause problems in applications, as it’s a “hybrid” text/binary string: the two bytes of é in UTF-8, plus character literal 0x100.

PROPOSAL: There should at least be a warning, if not an exception, when this happens.

Felipe Gasper · Answer 1 · Sat Jan 29 2022 01:22:55 GMT+0800 (China Standard Time)

This proposal would invalidate the following in t/14_latin1.t:

is($xs->decode ("\"\\u0012\x{89}\\u0abc\""), "\x{12}\x{89}\x{abc}");

Felipe Gasper · Answer 2 · Sat Jan 29 2022 02:54:09 GMT+0800 (China Standard Time)

This patch seems to do the trick:

diff --git a/XS.xs b/XS.xs
index 3fb3b16..9cea426 100644
--- a/XS.xs
+++ b/XS.xs
@@ -3392,6 +3392,11 @@ _decode_str (pTHX_ dec_t *dec, char endstr)
                       dec_cur = dec->cur;
                       if (hi == (UV)-1)
                         goto fail;
+
+                      if ((dec->json.flags ^ F_UTF8) && (hi & 0xff00)) {
+                        ERR ("illegal >255 \\u escape with utf8 disabled");
+                      }
+
                      if (dec->json.flags & F_BINARY)
                         ERR ("illegal unicode character in binary string");

If this is acceptable I’ll update the tests and submit a PR.

Reini Urban · Answer 3 · Sat Jan 29 2022 04:56:31 GMT+0800 (China Standard Time)

Well, looks good to me if it doesn't break any tests. But that's the decoding part, encode would be more important imho. "Be liberal in what you accept, and strict in what you write."
So maybe a warning would be better

Felipe Gasper · Answer 4 · Sat Jan 29 2022 05:02:29 GMT+0800 (China Standard Time)

It definitely breaks tests, but the tests are testing for a nonsensical result: there’s nothing useful to do with a string that’s partly-bytes, partly-characters. I’m happy to update/remove as appropriate.

This won’t affect encoding. encode() without utf8 mimics JavaScript’s JSON API, which has its uses. For example, maybe you want to assembly the JSON string as part of some larger character string, then UTF-8-encode that larger string.

Dan Book · Answer 5 · Sat Jan 29 2022 05:07:25 GMT+0800 (China Standard Time)

I'm not sure this is the right thing to do. Decoding without utf8 after the string has been decoded from UTF-8 is perfectly reasonable - such as is done by Mojo::Message::Response->json, since the body is decoded from bytes before passing to the decoder. The failure mode where you end up with mixed encoding is decoding from JSON before decoding from UTF-8.

Felipe Gasper · Answer 6 · Sat Jan 29 2022 05:22:08 GMT+0800 (China Standard Time)

Ah true, the decode-of-character-string case is legit.