normalization does not commute with case-folding?
stevengj opened this issue · comments
I noticed an odd case in JuliaLang/julia#52408 (comment):
julia> using Unicode: normalize
julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"
julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false
julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false
(The Julia Unicode.normalize
function calls utf8proc
, and defaults to NFC normalization.)
Not sure if this is a bug or just a weird behavior of Unicode. Would be good to try it out with ICU or some other library.
I get something similar in Python 3:
>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False
So I guess this is a weird quirk of Unicode?
That's quite unfortunate. Seems like exactly the kind of thing the Unicode Consortium is supposed to think through and avoid.