normalization does not commute with case-folding?

Question

normalization does not commute with case-folding?

stevengj opened this issue 8 months ago · comments

Steven G. Johnson commented 8 months ago

I noticed an odd case in JuliaLang/julia#52408 (comment):

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

(The Julia Unicode.normalize function calls utf8proc, and defaults to NFC normalization.)

Not sure if this is a bug or just a weird behavior of Unicode. Would be good to try it out with ICU or some other library.

Steven G. Johnson · Answer 1 · Fri Dec 08 2023 11:44:11 GMT+0800 (China Standard Time)

I get something similar in Python 3:

>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False

So I guess this is a weird quirk of Unicode?

Stefan Karpinski · Answer 2 · Tue Dec 19 2023 20:49:57 GMT+0800 (China Standard Time)

That's quite unfortunate. Seems like exactly the kind of thing the Unicode Consortium is supposed to think through and avoid.