Critical issue with Unicode parsing

Question

Critical issue with Unicode parsing

ferd opened this issue 13 years ago · comments

This issue is derived from a bug we have in socket.io-erlang, which turns out to be a problem with misultin.

Someone reported to us that strings would fail when containing complex unicode characters and sent us some sample. The user would input a string like "m7mПривет!" and have crashes happening all the time.

Looking at the stack trace he'd send us, I saw the following data in the request body: <<"data=%7Em%7E7%7Em%7E%D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82%21">>. Decoding it with misultin, it returns:

3> misultin_utility:parse_qs(S).
[{"data",
   [126,109,126,55,126,109,126,208,159,209,128,208,184,208,178,208,181,209,130,33]}]
4> io:format("~ts~n",[[126,109,126,55,126,109,126,208,159,209,128,208,184,208,178,208,181,209,130,33]]).              
~m~7~m~Ð�Ñ�Ð¸Ð²ÐµÑ�!
ok
5> io:format("~ts~n", [list_to_binary([126,109,126,55,126,109,126,208,159,209,128,208,184,208,178,208,181,209,130,33])]).
~m~7~m~Привет!
ok

The issue, as far as I can see, is that the unicode is parsed as a binary, which turns all code points in bytes (0..255). Then the binaries are blindly turned into lists, but they don't have the same unicode format in Erlang -- you actually need to convert them to codepoints greater than 255, and then output them with a ~ts combination instead of just ~s.

The problem is that you're likely using binary_to_list to convert strings, but when you have unicode data with grapheme clusters, what you need is something more like this:

6> io:format("~ts~n",[unicode:characters_to_list(<<126,109,126,55,126,109,126,208,159,209,128,208,184,208,178,208,181,209,130,33>>)]).
~m~7~m~Привет!
ok

This use of unicode:characters_to_list(Str) will work with IO lists and convert to whatever format you use. Note that the strings returned might now become invalid IO lists and should need to be converted back with unicode:characters_to_binary/1 before being pushed back in a socket.

Roberto Ostinelli · Answer 1 · Mon Sep 26 2011 19:14:37 GMT+0800 (China Standard Time)

hi.

misultin reads the body of a request as binary. when a misultin_utility:parse_qs(Body). is called, that is the moment when the quoted strings get converted to a list.

can you please provide with a usercase so we can check out if switching to using unicode:characters_to_list/1 does actually solve the issue?

your note on invalid io_list() is taken, obviously these are related and so will be investigated.

Fred Hebert · Answer 2 · Mon Sep 26 2011 19:23:08 GMT+0800 (China Standard Time)

See what I posted in my problem description. This unicode string with <<"data=%7Em%7E7%7Em%7E%D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82%21">> is precisely the content-body provoking a failure. If you want a full stacktrace, see yrashk/socket.io-erlang#56 (comment)

The path of the function seems to go from misultin_req:parse_post/1 to misultin_utility:parse_qs(Req#req.body) when we get an application/x-www-form-urlencoded form submitted. Note that by reading the code for multipart/form-data forms, I see that you also have a bunch of list_to_binary/1 calls within parse_multipart_form_data/2, which are also guaranteed to cause problem on some unicode input.

Roberto Ostinelli · Answer 3 · Mon Sep 26 2011 19:41:44 GMT+0800 (China Standard Time)

hi @ferd,

simply having a content-body like this does not provoke a failure in misultin.

<html>
<head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
    <form action="http://localhost:8080/" method="post">
        <input type="hidden" name="data" value="~m~7~m~Привет!">
        <input type="submit" value="GO!">
    </form>
</body>
</html>

if you use this code against, for instance, https://github.com/ostinelli/misultin/blob/dev/examples/misultin_echo.erl you will have it correctly display:

<misultin_test>
    <method>POST</method>
    <param>
        <name>data</name>
        <value>~m~7~m~Привет!</value>
    </param>
</misultin_test>

obviously this works since nothing is done at developer side to do actually something with the unicode string, which simply gets passed back to the browser as is in binary form.

so what you are asking is actually to have correctly formed lists from incoming unicode strings, instead of mere lists, something which has until now been left for the developer to do.

am i correct?

Fred Hebert · Answer 4 · Mon Sep 26 2011 19:52:09 GMT+0800 (China Standard Time)

It works if you pass the string through without modifying it, yes, because the IO list will be a valid sequence of bytes, but not Unicode strings. The difference being that the list contains bytes that when converted to a binary, gives you some representation of Unicode.

If you want a quick correctness test (untested here) | would suggest using CURL or WGET (or any other raw HTTP tool) and sending in the following content-body: data=%7Em%7E7%7Em%7E%D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82%21. This will allow you to test with the same representation as we used.

As an operation, output the length of the resulting list you receive. We would expect to see the result as being '14' (m7mПривет!), but it will likely turn out to be larger than that once you print it. This won't break the code you output, but will help show how things can break. If you want to play with it more, split the list at character 12 and see if it splits into "m7mПриве" + "т!" or if things are done differently.

This will show that, yes, we can't have mere lists, but we need lists that convert sequences of bytes (the binary unicode representation) to unicode strings (lists with precise codepoints and grapheme clusters). It is so far left for the developer to do, but it's the developer's responsibility to reverse-engineer the issue (that you need to convert the list to binary, then back to unicode). Using native binaries (pushing it on the dev in an easier way) or converting to unicode strings in the first place would work, I think.

Roberto Ostinelli · Answer 5 · Mon Sep 26 2011 20:12:58 GMT+0800 (China Standard Time)

yes, this is because parse_qs/1 does return data in UTF-8 format, as per mochiweb implementation, where it has been pulled from: "the return value is a list of octets, and the octets are assumed to be probably UTF-8".

so what is returned is UTF-8 and you are asking for the unicode representation.

what you are asking is totally sensible, though i'm trying to understand what is the best option for this.

Fred Hebert · Answer 6 · Mon Sep 26 2011 20:17:37 GMT+0800 (China Standard Time)

The problem is 'octets are assumed to be probably UTF-8'. This works in binary because octets (or bytes) in a binary do represent unicode. However, when converting them to a list, they have to be turned into code points.

An easy example is the composable string 'é' (latin e + composable acute accent). The binary representation of such a string is <<101,204,129>>. The correct string implementation, however, is [101,769]. The problem being that points such as '769' cannot be represented as a single byte. So the string representation [101,204,129] will be equivalent to an entirely different Unicode sequence (also all valid characters). That's why you need careful conversion using the unicode module.

Roberto Ostinelli · Answer 7 · Mon Sep 26 2011 20:22:34 GMT+0800 (China Standard Time)

how would you recommend when to instruct misultin to convert to unicode and when not to?

Fred Hebert · Answer 8 · Mon Sep 26 2011 20:27:25 GMT+0800 (China Standard Time)

That's a very good question. In general, always converting to unicode will not break ASCII or latin-1 strings: they are usually the same for the 0..255 range and it can be done safely. Any time you convert to a list, you could do it that way.

Obviously, doing so will break the behaviour of people actually expecting a byte list out of their content -- they will want to keep the raw conversion. I figure the only truly safe way to do things will be by adding a function (or argument) that specifies that you expect unicode content from the body of the post and branch differently that way. Either that or raw binaries, which will make it simpler to just push the problem back to the dev, but would require the annoying task of duplicating your parsing code.

I'm not exactly sure what would be the nicest way there.

Roberto Ostinelli · Answer 9 · Mon Sep 26 2011 20:38:04 GMT+0800 (China Standard Time)

so you are suggesting, for instance, a function like Req:parse_post(unicode) which basically means adding a simple unicode:characters_to_list/1 in the path somewhere [easily done], so to use the current parser and avoid having the developer to build her own.

however the developer should also be aware that she needs to convert her unicode output back to io_list() at someplace, prior sending it to the socket.

would you think that adding the unicode option as stated here above would help solve this?

Fred Hebert · Answer 10 · Mon Sep 26 2011 20:42:16 GMT+0800 (China Standard Time)

It would help (the unicode:characters_to_list/1 needs to be called as a string conversion). If you specify that you only accept iolists, then users of binaries have nothing to change, and users of strings will have to convert their stuff back.

Note that you could also add a 'unicode' option to the 'send' operation you do, which would convert from unicode to a binary. All binaries are fair play for the output (and binaries are part of io lists, so this is safe). I'm not sure I'm especially clear here?

Roberto Ostinelli · Answer 11 · Mon Sep 26 2011 22:18:44 GMT+0800 (China Standard Time)

can you please check if this can be an optimum solution for you too?
https://github.com/ostinelli/misultin/blob/unicode/examples/misultin_unicode.erl

there's a branch for this issue, called 'unicode'.

also, the todo list includes Res:resource to also support the unicode option.

Roberto Ostinelli · Answer 12 · Thu Nov 17 2011 01:02:52 GMT+0800 (China Standard Time)

no feedback on this issue. i will probably add unicode support in REST and then close this issue.

Roberto Ostinelli · Answer 13 · Sun Nov 20 2011 02:14:56 GMT+0800 (China Standard Time)

closed bcb74f2