Generated strings should be valid UTF-16

Question

Generated strings should be valid UTF-16

ochrons opened this issue 9 years ago · comments

Current implementation uses random 16-bit characters, which may not form a valid UTF-16 string. String should be formed by converting codepoints into UTF-16.

https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html

Sébastien Doeraene · Answer 1 · Mon Apr 20 2015 19:31:25 GMT+0800 (China Standard Time)

The easiest way is to generate a random sequence of code points (Ints from 0x000000 to 0x10ffff, inclusive), then encode it in UTF-16 using new String(codePoints: Array[Int], offset: Int, count: Int)

David Barri · Answer 2 · Tue Apr 21 2015 06:56:44 GMT+0800 (China Standard Time)

@sjrd Doesn't work. An example of a random code point is 0xd84f which is a valid codepoint but not valid unicode.

scala> val u = "UTF-32"
u: String = UTF-32

scala> val cp = 0xd84f
cp: Int = 55375

scala> val s = new String(Array[Int](cp), 0, 1)
s: String = ?

scala> val b = s.getBytes(u)
b: Array[Byte] = Array(0, 0, -1, -3)

scala> s.codePoints.toArray
res6: Array[Int] = Array(55375)

scala> (new String(b, u)).codePoints.toArray
res7: Array[Int] = Array(65533)

David Barri · Answer 3 · Tue Apr 21 2015 07:02:32 GMT+0800 (China Standard Time)

Actually, parsing a bunch of random bytes as UTF-16 works (or at least seems to over 1,000,000 reps x 8 bytes). I think this will do fine.

byte.list.map(b => new String(b.toArray, "UTF-16"))

Otto Chrons · Answer 4 · Tue Apr 21 2015 13:42:08 GMT+0800 (China Standard Time)

Random bytes have the downside of generating a lot of 0xFFFD (invalid char) because a big part of the range is invalid (about 1/32), unless correctly followed by a specific range to form a code point beyond 0x10000.

Also this strategy generates very few valid code points beyond 0x10000 (about 1/4096) which may or may not be a problem. These code points are very rarely used, but it would be good to have more of them in the test data.

David Barri · Answer 5 · Tue Apr 21 2015 13:53:26 GMT+0800 (China Standard Time)

It would be good but I don't know how to do it (without a big lookup table of codepoints that are valid unicide).

Otto Chrons · Answer 6 · Tue Apr 21 2015 14:01:45 GMT+0800 (China Standard Time)

Well, all code points between 0 - 0x10FFFF are valid Unicode, but they will generate 1-2 UTF16 characters. But if you do generate random code points, then about 15/16 of those will be not on the base plane (where practically all the useful unicode chars are), so you would need to bias the random results heavily to generate "representative" sample of Unicode chars.

David Barri · Answer 7 · Tue Apr 21 2015 14:13:11 GMT+0800 (China Standard Time)

all code points between 0 - 0x10FFFF are valid Unicode

That doesn't seem to be the case. Have you looked at the snippet above I posted?

Otto Chrons · Answer 8 · Tue Apr 21 2015 15:04:02 GMT+0800 (China Standard Time)

Yeah, you might want to avoid 0xD800-0xDFFF as they are used for surrogate pairs. All of those probably end up as 0xFFFD in the conversion, which is ok, but not that useful.

Sébastien Doeraene · Answer 9 · Tue Apr 21 2015 16:35:57 GMT+0800 (China Standard Time)

Indeed, you're right. You should also exclude the range 0xd800-0xdfff.

This discussion prompts me to suggest that there be two ways to generate Strings:

An arbitrary sequence of Chars (no restriction).
A sequence of non-surrogate code points (0x0000-0xd7ff U 0xe000-0x10ffff), converted to UTF-16.

David Barri · Answer 10 · Wed Apr 22 2015 16:53:45 GMT+0800 (China Standard Time)

Thanks guys, I'll take another look at this. In fact I'll bloody prove that it works (what Domain is for) instead of random sampling.

You know I knew all about utf8, utf16 and a bunch of other japanese & "western" charsets 10 years ago. Knew aaaaaaall about it. It seems the world has moved on since then. I feel like I don't know much anymore.

David Barri · Answer 11 · Thu Oct 01 2015 14:18:02 GMT+0800 (China Standard Time)

Fixed (see).