JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data

Home Page:http://juliastrings.github.io/utf8proc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

normalization does not commute with case-folding?

stevengj opened this issue · comments

I noticed an odd case in JuliaLang/julia#52408 (comment):

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

(The Julia Unicode.normalize function calls utf8proc, and defaults to NFC normalization.)

Not sure if this is a bug or just a weird behavior of Unicode. Would be good to try it out with ICU or some other library.

I get something similar in Python 3:

>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False

So I guess this is a weird quirk of Unicode?

That's quite unfortunate. Seems like exactly the kind of thing the Unicode Consortium is supposed to think through and avoid.