Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0).

Question

Guard t_toCaseFold_char with #if MIN_VERSION_base(4,16,0).

Bodigrim opened this issue a year ago · comments

On GHC 9.6 t_toCaseFold_char fails on '\66937', which is VITHKUQI CAPITAL LETTER FE' (U+10579) added in Unicode 14.0. The test should be guarded with #if MIN_VERSION_base(4,16,0).

Xia Li-yao · Answer 1 · Wed Feb 08 2023 11:21:26 GMT+0800 (China Standard Time)

That may be the right fix for now, but it's worth noting it's only a temporary fix. Is there a longer term solution?

The root of the problem is that GHC and text don't use the same Unicode version. That makes our tests flaky, as in this case, and this is likely to happen again for any fix that looks only at base/GHC versions. Regardless of how we fix our tests, users of mismatched versions of GHC and text may also get very surprising bugs. This shouldn't happen for distributions that keep them in sync, hence the status of text as a core library that ships with GHC has been a pretty good mitigating factor so far; that might explain why we haven't seen reports of it already. But it seems worth reducing the risk further.

Ideally, the Unicode version should be set only once, and it seems reasonable that GHC should ultimately be responsible for this. This doesn't seem overly hard, but there are some executive decisions to be made, notably regarding who should fetch CaseFolding.txt and SpecialCase.txt. (I think that should end up in GHC, but we would then have to figure out how to maintain backward compatibility with existing GHCs).

I might still be useful to enable building text with an arbitrary Unicode version, independent of GHC. Imagine an application with Unicode-heavy data that they don't want to break when they upgrade or downgrade GHC.

It's fun to look at the QuickCheck test that failed. It looks like there are ~50 counterexamples (new upper case letters in Unicode 14), and Chars are sampled among the ~150k assigned ones. That's enough that we were bound to run into it, but it's not everyday that we get to look at a QuickCheck property that fails 3% of the time (with the standard 100 iterations).

ˌbodʲɪˈɡrʲim · Answer 2 · Tue Feb 14 2023 07:31:36 GMT+0800 (China Standard Time)

Fundamentally Data.Text.toLower and map Data.Char.toLower are different functions with different semantics, even if both were generated from the same Unicode database. So in my view it's purely internal issue how to test Data.Text.toLower: instead of combininig Data.Char.toLower and a hand-crafted list of exclusions, we could have incorporated SpecialCase.txt into test suite and match against it directly.

I'll go with a tactical fix for now, just to keep CI green.

Oleg Grenrus · Answer 3 · Tue Feb 14 2023 17:05:52 GMT+0800 (China Standard Time)

instead of combininig Data.Char.toLower and a hand-crafted list of exclusions

That is btw how Data.Text.toLower was implemented. List of exclusions/overrides and then calling Data.Char.toLower on the rest.

That didn't work well (e.g. issue #277, fix #402).

So probably won't work well for tests either :)