Invalid Unicode Escapes
cyanskies opened this issue · comments
The following test contain invalid unicode escapes:
- tests/valid/key/quoted-unicode.json
- tests/valid/string/quoted-unicode.json
Both tests contain strings containing this sequence: "\ud800\udc00 \udbff\udfff"
All four of these escape codes fall outside the unicode scalar values. I suspect they're supposed to be \UXXXXXXXX style escapes that have been generated incorrectly.
Are you sure that's invalid? I think that's just how JSON works because it's always in UTF-16 or something, but I'd have to read the spec to be sure.
e.g. \U0010ffff
in TOML is 0xdb 0xff 0xdf 0xff
in UTF-16 BE, and that fits with the \udbff\udfff
in the JSON.
Which language/JSON parser are you using for this? Most languages seem to work fine with these escapes, but I'm always open to changing something if it improves compatibility.
I guess it's this one: cyanskies/another-toml-cpp#11
Maybe I'm missing something, but I don't see a way to run the tests?
I run the tests using this repo https://github.com/cyanskies/another-toml-test
It builds encoder and decoder executables that I test using the precompiled toml-test executable.
I'm using an in-tree copy of SimpleJSON.
I was assuming that JSON was in utf8 and passing the string across directly, so it might be my mistake then.
From https://datatracker.ietf.org/doc/html/rfc8259 :
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
So seems the behaviour is correct.
That SimpleJSON hasn't been updated since 2016. Maybe it's bugged? Your project doesn't compile for me.
toml++ uses https://github.com/nlohmann/json: https://github.com/marzer/tomlplusplus/tree/master/vendor
toml11 has some thing they wrote themselves: https://github.com/ToruNiina/toml11/blob/master/tests/check_toml_test.cpp