dbuenzli / uunf

Unicode text normalization for OCaml

Home Page:http://erratique.ch/software/uunf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Normalisation bug with Hangul / symbols sequence

stedolan opened this issue · comments

I re-ran the fuzzing job after #8 was fixed, and it found another case:

                   s: [U+c100 U+20d2 U+11c1 U+11c1 "섀⃒ᇁᇁ"]
            toNFC(s): [U+c11a U+20d2 U+11c1 "섚⃒ᇁ"]
            toNFD(s): [U+1109 U+1164 U+20d2 U+11c1 U+11c1 "섀⃒ᇁᇁ"]
     toNFD(toNFC(s)): [U+1109 U+1164 U+11c1 U+20d2 U+11c1 "섚⃒ᇁ"]

The last two lines should be equal.

Thanks. It seems that I broke something in the 9459c90 fix since this was previously correct:

> unftrip -a --nfc
섀⃒ᇁᇁ
U+C100
U+20D2
U+11C1
U+11C1
U+000A

The bug introduced was that I would combine two characters with ccc=0 even if there was a character between them that has ccc<>0 which not what the composition algorithm mandates. In this case I would compose U+C100 with U+11C1 which yields U+C11A but the U+20D2 between the two prevents this.

Further testing and breakage welcome.

For the record, further fuzzing revealed no more bugs. I ran afl-fuzz for a few days (at 10k tests/sec), and it got to pending=0 with no new paths found for more than a day (which is the closest that afl-fuzz ever gets to saying that it's "done"). The test was to check that these equations hold on arbitrary input sequences.

Cool, thanks for the report !