tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MaybeSimpleCaseClosure broken on multi-character strings

waldemarhorwat opened this issue · comments

MaybeSimpleCaseClosure only computes the case closure on single-character strings and passes other strings unchanged. This looks like it might be intentional, but it's also incorrect. Here's why.

/[\q{k}&&\q{K}]/vi will accept k, K, and the Kelvin symbol.

/[\q{kk}&&\q{Kk}]/vi will accept nothing. It should accept all variations of two k's of any case.

I see a couple ways to proceed:

  • (fewer spec changes to fix the bug) Compute case closures in MaybeSimpleCaseClosure. When applied to CharSets of strings this will create case closure CharSets that are exponentially large, but this is just spec fiction and easily optimized by implementations.
  • (spec closer to how implementations would actually do this) Case-canonicalize everything instead of computing case closures. This requires that the sets you're subtracting from when computing complements such as in ^ must be case-canonicalized as well.

We had discussed and resolved this with Waldemar a few weeks ago without looking at this GitHub issue here. We chose the second approach. Case-folding a set with replace each set element (consisting of any number of characters) with its Simple_Case_Folding equivalent, and the code point complement starts from a set of characters that don't case-fold, rather than all code points.