Normalisation bug with Hangul / symbols sequence

Question

Normalisation bug with Hangul / symbols sequence

stedolan opened this issue 7 years ago · comments

I re-ran the fuzzing job after #8 was fixed, and it found another case:

                   s: [U+c100 U+20d2 U+11c1 U+11c1 "섀⃒ᇁᇁ"]
            toNFC(s): [U+c11a U+20d2 U+11c1 "섚⃒ᇁ"]
            toNFD(s): [U+1109 U+1164 U+20d2 U+11c1 U+11c1 "섀⃒ᇁᇁ"]
     toNFD(toNFC(s)): [U+1109 U+1164 U+11c1 U+20d2 U+11c1 "섚⃒ᇁ"]

The last two lines should be equal.

Daniel Bünzli · Answer 1 · Fri May 26 2017 23:46:04 GMT+0800 (China Standard Time)

Thanks. It seems that I broke something in the 9459c90 fix since this was previously correct:

> unftrip -a --nfc
섀⃒ᇁᇁ
U+C100
U+20D2
U+11C1
U+11C1
U+000A

Daniel Bünzli · Answer 2 · Sat May 27 2017 00:15:57 GMT+0800 (China Standard Time)

The bug introduced was that I would combine two characters with ccc=0 even if there was a character between them that has ccc<>0 which not what the composition algorithm mandates. In this case I would compose U+C100 with U+11C1 which yields U+C11A but the U+20D2 between the two prevents this.

Further testing and breakage welcome.

Stephen Dolan · Answer 3 · Tue May 30 2017 17:33:09 GMT+0800 (China Standard Time)

For the record, further fuzzing revealed no more bugs. I ran afl-fuzz for a few days (at 10k tests/sec), and it got to pending=0 with no new paths found for more than a day (which is the closest that afl-fuzz ever gets to saying that it's "done"). The test was to check that these equations hold on arbitrary input sequences.

Daniel Bünzli · Answer 4 · Tue May 30 2017 17:40:05 GMT+0800 (China Standard Time)

Cool, thanks for the report !