IgnoreCase vs. complement vs. nested class

Question

IgnoreCase vs. complement vs. nested class

markusicu opened this issue 3 years ago · comments

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

For each single code point c in A, add every other code point d where scf(d)==c or where scf(c)==d

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

literal character c: create a CharSet A with just c and return SimpleCaseClosure(A)
range a-b: create a CharSet A with the one contiguous range of code points from a to b and return SimpleCaseClosure(A)
- Note: The result will often consist of two or more ranges. SimpleCaseClosure([a-z]) will include the separate range [A-Z] and several other non-adjacent code points.
\p{X}: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)
\P{X}: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B
same with \w \W \s \S etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes
set operations work as usual
[^...]: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=false

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

The spec already allows one level of nested classes in the form of CharacterClassEscapes inside CharacterClasses.
The complement of a CharacterClassEscape (\ plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.
The complement of a CharacterClass ([^ ClassRanges ]) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.
When IgnoreCase=true, then Canonicalize (which the CharacterSetMatcher invokes) applies the Unicode Simple_Case_Folding (scf) before applying the "invert" complement.

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

"aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
- The CharSet contains all lowercase letters.
- invert=false
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n).
- No match for digits, punctuation, etc.
"aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"
- The CharSet contains all Unicode code points other than lowercase letters: Uppercase/titlecase/other letters, digits, symbols, punctuation, unassigned, etc.
- invert=true
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n), but "invert" negates "found".
- No match for any letters.
- No match for digits/punctuation/etc. either because they are in the CharSet directly and "invert" negates "found".
- In other words, under IgnoreCase \P{Ll} matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

Mark Davis · Answer 1 · Thu Jun 03 2021 13:22:28 GMT+0800 (China Standard Time)

I propose that we adopt a uniform approach to case insensitively, where:

if C is a character class, and "a" represents a character that has case variants, and "1" represents a character that has no case variants, then:

if /C/ matches "a", then /C/i matches "a" and all case variants of "a"
if /C/ doesn't match "a" or any case variant of "a", then /C/i doesn't match "a" or any case variant of "a"
if /C/ matches "1", then /C/i matches "1"
if /C/ doesn't match "1", then /C/i doesn't match "1"

And a uniform approach to complement, where [^[^X]] == X, and [^\p{X}] == [\P{X}]

Andy Heninger · Answer 2 · Fri Jun 04 2021 04:46:20 GMT+0800 (China Standard Time)

In the earlier email thread on this question I had looked into the behavior of other regex implementations. Copying those results here, for the record ...

RE                /[^ab]/i    /[[a-m]&&[F-T]]/i   /[[a-m]--[F-T]]/i
-------------------------------------------------------------------
Java              deep        deep                na
ICU / Android     deep        deep                deep
Python RE         deep        na (future)         na
Python regex      deep        deep                deep
.NET              deep        na                  deep
Rust regex        deep        deep                deep
Perl              deep        deep                deep 
Oniguruma / Ruby  deep        shallow             na

ECMAScript        deep        na                  na
PCRE / PHP        deep        na                  na
RE2 / Go          deep        na                  na
grep ERE          deep        na                  nz

Here are the expected results for the patterns shown and the test strings "abcdefghi" "ABCDEFGHI". Matched letters are shown as themselves; non-matches as ".".The logic for what matches in each case is exactly as Markus described in his previous email.

Pattern                Deep Case Close       Shallow Case Close
----------------------------------------------------------------
/[^ab]/i               ..cdefghi ..CDEFGHI   abcdefghi ABCDEFGHI
/[[a-m]&&[F-T]]/i      .....fghi .....FGHI   ......... .........
/[[a-m]--[F-T]]/i      abcde.... ABCDE....   abcdefghi ABCDEFGHI

Notes:

Perl's support for set expression is experimental, and subject to change. I missed it in the previous iteration. See https://perldoc.perl.org/perlrecharclass#Extended-Bracketed-Character-Classes
It uses (?[ ... ])to delimit set expressions with a new, extended syntax.

Symmetric difference (exclusive OR) is supported by Perl, Python and Rust. '^' for Perl, '~' for the others.

PCRE set expression syntax.

.NET character class subtraction.

Rust character class syntax

Markus Scherer · Answer 3 · Sat Jun 05 2021 02:01:35 GMT+0800 (China Standard Time)

Thanks, Mark & Andy.

These discussions about desired and observed outcomes are useful, but now that we are at stage 2, what I am most looking for is concrete proposals for how to change the ECMAScript spec, together with whether & how that changes u flag behavior, and what the resulting behavior looks like for v expressions that are the same / similar / use features not available in u.

For example, if a proposal involves changing the current semantics of CharSet+invert for a CharacterClass vs. CharSet-only for a CharacterClassEscape, then we need to spell that out, and need to say whether we want to change semantics for existing code, or else document that the same expression will behave differently depending on the v flag.

Mark Davis · Answer 4 · Sat Jun 05 2021 02:08:34 GMT+0800 (China Standard Time)

I suggest that we only propose a change for v. That maintains complete backwards compatibility. If then the committee wants to extend the changes to non-V it's fairly easy to do. I'd be glad to compare what it looks like for the simple option that I outlined. But that would be after I get back from vacation so starting around the 14th

Mathias Bynens · Answer 5 · Fri Jun 11 2021 00:32:30 GMT+0800 (China Standard Time)

@schuay @msaboff, any opinions or preferences from an implementation point of view?

Jakob Gruber · Answer 6 · Mon Jun 14 2021 17:16:51 GMT+0800 (China Standard Time)

Currently no opinion since I'm not too familiar with the implementation. cc @hashseed

Yang Guo · Answer 7 · Mon Jun 14 2021 17:28:39 GMT+0800 (China Standard Time)

Intuitively I would also expect adding ignoreCase to match more, not less. That's just my personal opinion though.

In the current implementation in V8, iirc, we apply case folding close over for individual \p, which of course means that with inversion we end up excluding more. The solution would be to wait with closing over for case folding after the character class has been fully defined.

Markus Scherer · Answer 8 · Fri Jun 18 2021 01:11:14 GMT+0800 (China Standard Time)

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We agreed that the behavior of /[^\P{Ll}]/giu is unintuitive/surprising, and that /[^\P{Ll}]/giv (note the v flag) should behave the same as /\p{Ll}/giu and /\p{Ll}/giv.

Therefore, we propose to make v behavior of IgnoreCase different from u behavior. Specifically, under v a CharacterClass always (in both a top-level class and a nested class) resolves the ^ by complementing the CharSet (just like a CharacterClassEscape with an uppercase letter), and it always keeps invert==false.

With this example:

"aAbBcC4#".replaceAll(/\p{Ll}/giu, 'X') outputs "XXXXXX4#"
"aAbBcC4#".replaceAll(/[^\P{Ll}]/giu, 'X') outputs "aAbBcC4#"
"aAbBcC4#".replaceAll(/\p{Ll}/giv, 'X') outputs "XXXXXX4#"
"aAbBcC4#".replaceAll(/[^\P{Ll}]/giv, 'X') outputs "XXXXXX4#"

More generally, this makes [^\p{X}] == [\P{X}] == \P{X} and [^\P{X}] == [\p{X}] == \p{X}, whether IgnoreCase or not.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

Waldemar Horwat · Answer 9 · Tue Jun 29 2021 08:18:00 GMT+0800 (China Standard Time)

The invert flag is there for a reason. Doing the above change would break fundamental regexps:

"aAbBcC".replaceAll(/[b]/giu, "X") outputs "aAXXcC"
"aAbBcC".replaceAll(/[^b]/giu, "X") outputs "XXbBXX"
"aAbBcC".replaceAll(/[b]/giv, "X") outputs "aAXXcC"
"aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

Markus Scherer · Answer 10 · Fri Jul 09 2021 01:10:23 GMT+0800 (China Standard Time)

"aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

[^b] includes B so it makes sense, under IgnoreCase, for this class to match both b and B.
Same as [\p{L}--b]: Includes B so should IgnoreCase-match both b and B.
Or any other class that excludes some but not all IgnoreCase kinds of "b".
[^b] should be exactly the same as [[\0-\u{10FFFF}]--b] and [[^b]] etc.
Different ways of arriving at the same class should behave consistently.

Andy Heninger · Answer 11 · Mon Jul 12 2021 07:03:56 GMT+0800 (China Standard Time)

The model I had in my head for ICU case insensitive regex matching is that it should produce the same results that you would get by case-folding both the input text and the pattern, and then doing a regular match with those.

Folding the pattern is tricky, though. Transform it by

folding literals immediately
folding ranges after establishing an unfolded start and end.
folding positive properties \p{X} into an equivalent set expression as soon as they are encountered.
negated properties, \P{X}: fetch the positive property, fold, then negate, yielding a set expression.

Then process the transformed pattern normally (case sensitive, no more folding, no case closing).

ICU follows this behavior, with the the fact that the actual implementation uses case-closed sets and case-insensitive string matching being an optimization.

Taking the example from above,

"aAbBcC".replaceAll(/[^b]/giv, "X")

The transformed pattern and input string become

"aabbcc".replaceAll(/[^b]/gv, "X") which should output XXbbXX

The case-sensitive matching of [^b] would also match a B, but that will never occur because the input string is folded.

One twist in ICU's implementation is that sets use simple case folding/closure, while string literals in the pattern match with full Unicode casing.

Markus Scherer · Answer 12 · Mon Jul 12 2021 08:04:46 GMT+0800 (China Standard Time)

Thanks, Andy, that's useful. It would replace the separate invert flag with early case closure of the set. More expensive while parsing the pattern, less while matching. And if we use the same early closure (before complement) for \P{X} outside of character classes, then everything should be consistent. It would be a change to the behavior of those, of course.

It still seems weird that [^b] looks like a set that should match B but under IgnoreCase does not match either b or B. Waldemar seems to like it that way.

So first we should decide what the various expressions mentioned here should match. I hope we can agree on equivalent expressions matching equivalently. I think that means getting rid of the invert flag one way or the other.

@aheninger What exactly do you mean with “full Unicode casing”? Full string case folding, where "ß" becomes "ss"? If so, do you do that on the fly while matching, or do you full-case-fold the whole input string and maintain a mapping from folded-string offsets back to input-string offsets? How do you handle partial matches, like one pattern s against one half of an input string "ß"?

Markus Scherer · Answer 13 · Mon Jul 12 2021 08:26:19 GMT+0800 (China Standard Time)

Note: When we case-fold a class/set for IgnoreCase, we could add the full-case-folding strings, like [ß] → [ß(ss)].
However, we said we will forbid the complement of a set with strings, and we will validate early using property metadata, not actual property data.
It would probably surprise people that /[^ß]/iv should throw a SyntaxError, and the validation logic would need access to the full-case-folding mappings (or at least the set of characters that full-case-fold to multi-character strings).
The validation logic would also need property data; for example, \p{Ll} which contains ß.
It could also make an expression that works with one Unicode version fail with the next one -- when a code point that was unassigned is assigned a character with a multi-character case folding.

So best to not do that.

Markus Scherer · Answer 14 · Fri Jul 16 2021 02:45:29 GMT+0800 (China Standard Time)

Note: If we wanted to do early case closure to emulate the current behavior of /[^b]/iu without the buggy invert boolean, that would be like the existing behavior of /\w/iu which also does early case closure of \w. I consider it a defect in the current spec that there is no early case closure of other CharacterClassEscapes.

Markus Scherer · Answer 15 · Fri Jul 16 2021 04:25:17 GMT+0800 (China Standard Time)

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

consistent behavior as in our earlier proposal in #30 (comment)
deep, early case closure replacing the invert boolean as Andy described in #30 (comment)

Ok?

There is still an open question about Andy's approach of full case folding for matching of substrings (not code points in CharSets), see #30 (comment)

Mathias Bynens · Answer 16 · Fri Jul 16 2021 16:09:11 GMT+0800 (China Standard Time)

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

consistent behavior as in our earlier proposal in #30 (comment)

deep, early case closure replacing the invert boolean as Andy described in #30 (comment)

Ok?

SGTM

Mathias Bynens · Answer 17 · Wed Sep 01 2021 02:21:54 GMT+0800 (China Standard Time)

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something. Waldemar, do you have more details?

Markus Scherer · Answer 18 · Wed Sep 01 2021 02:58:04 GMT+0800 (China Standard Time)

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something.

Yes, I am baffled.

Current behavior of /[^x]/iu: (see CharacterSetMatcher)

set contains only 'x'
invert = true
matching:
- step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
- step i: there is a member a='x' in the set with a==cc, so found=true
- step k: invert==true && found==true: no match
if the text is 't' then found=false but invert==true: yes match

Proposed behavior of /[^x]/iv using deep case closure:

set contains 'x'
case closure adds 'X' because Canonicalize('X')=='x'
complement: set contains all characters except 'x' and 'X'
invert = false (always)
matching:
- step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
- step i: there is not a member a in the set with a==cc, so found=false
- step j: invert==false && found==false: no match
if the text is 't' then found=true but invert==false: yes match

That is, the behavior for any expression /[^...]/i should be equivalent.

What we are fixing is cases like /\p{Ll}/giu (all letters) ≠ /[^\P{Ll}]/giu (no letters)

Waldemar Horwat · Answer 19 · Wed Sep 01 2021 23:57:16 GMT+0800 (China Standard Time)

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

Markus Scherer · Answer 20 · Thu Sep 02 2021 00:58:50 GMT+0800 (China Standard Time)

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

We addressed those. The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

Markus Scherer · Answer 21 · Fri Sep 03 2021 01:34:46 GMT+0800 (China Standard Time)

I spelled out the proposed algorithm in this issue's description (text at the top).
PTAL
@macchiati @aheninger @mathiasbynens @gibson042 @bmeck @waldemarhorwat @sffc @schuay @msaboff @hashseed

Shane F. Carr · Answer 22 · Wed Sep 08 2021 08:28:41 GMT+0800 (China Standard Time)

The algorithm sounds good and well-defined; can you provide a few example regexes and what the behavior of those regexes would be with your proposed algorithm?

Markus Scherer · Answer 23 · Fri Sep 10 2021 00:49:04 GMT+0800 (China Standard Time)

Met today: Waldemar, Richard, Mark, Markus.
Waldemar and Richard agree that this proposal is good.
(After a small amendment to make the case closure symmetric. Case closure needs to add Kelvin for k, and k for Kelvin.)

Markus Scherer · Answer 24 · Fri Sep 24 2021 09:56:04 GMT+0800 (China Standard Time)

I updated the draft spec changes with the agreed algorithm, modulo inadvertent bugs. Review would be good. (If you have edit access to the doc, see the version history.)