rurban / Cpanel-JSON-XS

Improved fork of JSON-XS

Home Page:http://search.cpan.org/dist/Cpanel-JSON-XS/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When utf8 is off, reject \uXXXX

FGasper opened this issue · comments

When utf8() mode is off this module will happily produce things like this:

> perl -MJSON -MData::Dumper -e'print Dumper( JSON->new()->decode(q<["é\u0100"]>) )'
$VAR1 = [
          "\x{c3}\x{a9}\x{100}"
        ];

This will (almost) always cause problems in applications, as it’s a “hybrid” text/binary string: the two bytes of é in UTF-8, plus character literal 0x100.

PROPOSAL: There should at least be a warning, if not an exception, when this happens.

This proposal would invalidate the following in t/14_latin1.t:

is($xs->decode ("\"\\u0012\x{89}\\u0abc\""), "\x{12}\x{89}\x{abc}");

This patch seems to do the trick:

diff --git a/XS.xs b/XS.xs
index 3fb3b16..9cea426 100644
--- a/XS.xs
+++ b/XS.xs
@@ -3392,6 +3392,11 @@ _decode_str (pTHX_ dec_t *dec, char endstr)
                       dec_cur = dec->cur;
                       if (hi == (UV)-1)
                         goto fail;
+
+                      if ((dec->json.flags ^ F_UTF8) && (hi & 0xff00)) {
+                        ERR ("illegal >255 \\u escape with utf8 disabled");
+                      }
+
                      if (dec->json.flags & F_BINARY)
                         ERR ("illegal unicode character in binary string");

If this is acceptable I’ll update the tests and submit a PR.

Well, looks good to me if it doesn't break any tests. But that's the decoding part, encode would be more important imho. "Be liberal in what you accept, and strict in what you write."
So maybe a warning would be better

It definitely breaks tests, but the tests are testing for a nonsensical result: there’s nothing useful to do with a string that’s partly-bytes, partly-characters. I’m happy to update/remove as appropriate.

This won’t affect encoding. encode() without utf8 mimics JavaScript’s JSON API, which has its uses. For example, maybe you want to assembly the JSON string as part of some larger character string, then UTF-8-encode that larger string.

I'm not sure this is the right thing to do. Decoding without utf8 after the string has been decoded from UTF-8 is perfectly reasonable - such as is done by Mojo::Message::Response->json, since the body is decoded from bytes before passing to the decoder. The failure mode where you end up with mixed encoding is decoding from JSON before decoding from UTF-8.

Ah true, the decode-of-character-string case is legit.