japgolly / nyaya

Random Data Generation and/or Property Testing in Scala & Scala.JS.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generated strings should be valid UTF-16

ochrons opened this issue · comments

Current implementation uses random 16-bit characters, which may not form a valid UTF-16 string. String should be formed by converting codepoints into UTF-16.

https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html

The easiest way is to generate a random sequence of code points (Ints from 0x000000 to 0x10ffff, inclusive), then encode it in UTF-16 using new String(codePoints: Array[Int], offset: Int, count: Int)

@sjrd Doesn't work. An example of a random code point is 0xd84f which is a valid codepoint but not valid unicode.

scala> val u = "UTF-32"
u: String = UTF-32

scala> val cp = 0xd84f
cp: Int = 55375

scala> val s = new String(Array[Int](cp), 0, 1)
s: String = ?

scala> val b = s.getBytes(u)
b: Array[Byte] = Array(0, 0, -1, -3)

scala> s.codePoints.toArray
res6: Array[Int] = Array(55375)

scala> (new String(b, u)).codePoints.toArray
res7: Array[Int] = Array(65533)

Actually, parsing a bunch of random bytes as UTF-16 works (or at least seems to over 1,000,000 reps x 8 bytes). I think this will do fine.

byte.list.map(b => new String(b.toArray, "UTF-16"))

Random bytes have the downside of generating a lot of 0xFFFD (invalid char) because a big part of the range is invalid (about 1/32), unless correctly followed by a specific range to form a code point beyond 0x10000.

Also this strategy generates very few valid code points beyond 0x10000 (about 1/4096) which may or may not be a problem. These code points are very rarely used, but it would be good to have more of them in the test data.

It would be good but I don't know how to do it (without a big lookup table of codepoints that are valid unicide).

Well, all code points between 0 - 0x10FFFF are valid Unicode, but they will generate 1-2 UTF16 characters. But if you do generate random code points, then about 15/16 of those will be not on the base plane (where practically all the useful unicode chars are), so you would need to bias the random results heavily to generate "representative" sample of Unicode chars.

all code points between 0 - 0x10FFFF are valid Unicode

That doesn't seem to be the case. Have you looked at the snippet above I posted?

Yeah, you might want to avoid 0xD800-0xDFFF as they are used for surrogate pairs. All of those probably end up as 0xFFFD in the conversion, which is ok, but not that useful.

Indeed, you're right. You should also exclude the range 0xd800-0xdfff.

This discussion prompts me to suggest that there be two ways to generate Strings:

  • An arbitrary sequence of Chars (no restriction).
  • A sequence of non-surrogate code points (0x0000-0xd7ff U 0xe000-0x10ffff), converted to UTF-16.

Thanks guys, I'll take another look at this. In fact I'll bloody prove that it works (what Domain is for) instead of random sampling.

You know I knew all about utf8, utf16 and a bunch of other japanese & "western" charsets 10 years ago. Knew aaaaaaall about it. It seems the world has moved on since then. I feel like I don't know much anymore.

Fixed (see).