tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IgnoreCase vs. complement vs. nested class

markusicu opened this issue · comments

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

  • For each single code point c in A, add every other code point d where scf(d)==c or where scf(c)==d

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

  • literal character c: create a CharSet A with just c and return SimpleCaseClosure(A)
  • range a-b: create a CharSet A with the one contiguous range of code points from a to b and return SimpleCaseClosure(A)
    • Note: The result will often consist of two or more ranges. SimpleCaseClosure([a-z]) will include the separate range [A-Z] and several other non-adjacent code points.
  • \p{X}: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)
  • \P{X}: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B
  • same with \w \W \s \S etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes
  • set operations work as usual
  • [^...]: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=false

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

  • The spec already allows one level of nested classes in the form of CharacterClassEscapes inside CharacterClasses.
  • The complement of a CharacterClassEscape (\ plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.
  • The complement of a CharacterClass ([^ ClassRanges ]) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.
  • When IgnoreCase=true, then Canonicalize (which the CharacterSetMatcher invokes) applies the Unicode Simple_Case_Folding (scf) before applying the "invert" complement.

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

  1. "aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
    • The CharSet contains all lowercase letters.
    • invert=false
    • For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n).
    • No match for digits, punctuation, etc.
  2. "aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"
    • The CharSet contains all Unicode code points other than lowercase letters: Uppercase/titlecase/other letters, digits, symbols, punctuation, unassigned, etc.
    • invert=true
    • For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n), but "invert" negates "found".
    • No match for any letters.
    • No match for digits/punctuation/etc. either because they are in the CharSet directly and "invert" negates "found".
    • In other words, under IgnoreCase \P{Ll} matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

I propose that we adopt a uniform approach to case insensitively, where:

if C is a character class, and "a" represents a character that has case variants, and "1" represents a character that has no case variants, then:

  1. if /C/ matches "a", then /C/i matches "a" and all case variants of "a"

  2. if /C/ doesn't match "a" or any case variant of "a", then /C/i doesn't match "a" or any case variant of "a"

  3. if /C/ matches "1", then /C/i matches "1"

  4. if /C/ doesn't match "1", then /C/i doesn't match "1"


And a uniform approach to complement, where [^[^X]] == X, and [^\p{X}] == [\P{X}]

In the earlier email thread on this question I had looked into the behavior of other regex implementations. Copying those results here, for the record ...

RE                /[^ab]/i    /[[a-m]&&[F-T]]/i   /[[a-m]--[F-T]]/i
-------------------------------------------------------------------
Java              deep        deep                na
ICU / Android     deep        deep                deep
Python RE         deep        na (future)         na
Python regex      deep        deep                deep
.NET              deep        na                  deep
Rust regex        deep        deep                deep
Perl              deep        deep                deep 
Oniguruma / Ruby  deep        shallow             na

ECMAScript        deep        na                  na
PCRE / PHP        deep        na                  na
RE2 / Go          deep        na                  na
grep ERE          deep        na                  nz

Here are the expected results for the patterns shown and the test strings "abcdefghi" "ABCDEFGHI". Matched letters are shown as themselves; non-matches as ".".The logic for what matches in each case is exactly as Markus described in his previous email.

Pattern                Deep Case Close       Shallow Case Close
----------------------------------------------------------------
/[^ab]/i               ..cdefghi ..CDEFGHI   abcdefghi ABCDEFGHI
/[[a-m]&&[F-T]]/i      .....fghi .....FGHI   ......... .........
/[[a-m]--[F-T]]/i      abcde.... ABCDE....   abcdefghi ABCDEFGHI

Notes:

Perl's support for set expression is experimental, and subject to change. I missed it in the previous iteration. See https://perldoc.perl.org/perlrecharclass#Extended-Bracketed-Character-Classes
It uses (?[ ... ])to delimit set expressions with a new, extended syntax.

Symmetric difference (exclusive OR) is supported by Perl, Python and Rust. '^' for Perl, '~' for the others.

PCRE set expression syntax.

.NET character class subtraction.

Rust character class syntax

Thanks, Mark & Andy.

These discussions about desired and observed outcomes are useful, but now that we are at stage 2, what I am most looking for is concrete proposals for how to change the ECMAScript spec, together with whether & how that changes u flag behavior, and what the resulting behavior looks like for v expressions that are the same / similar / use features not available in u.

For example, if a proposal involves changing the current semantics of CharSet+invert for a CharacterClass vs. CharSet-only for a CharacterClassEscape, then we need to spell that out, and need to say whether we want to change semantics for existing code, or else document that the same expression will behave differently depending on the v flag.

@schuay @msaboff, any opinions or preferences from an implementation point of view?

Currently no opinion since I'm not too familiar with the implementation. cc @hashseed

Intuitively I would also expect adding ignoreCase to match more, not less. That's just my personal opinion though.

In the current implementation in V8, iirc, we apply case folding close over for individual \p, which of course means that with inversion we end up excluding more. The solution would be to wait with closing over for case folding after the character class has been fully defined.

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We agreed that the behavior of /[^\P{Ll}]/giu is unintuitive/surprising, and that /[^\P{Ll}]/giv (note the v flag) should behave the same as /\p{Ll}/giu and /\p{Ll}/giv.

Therefore, we propose to make v behavior of IgnoreCase different from u behavior. Specifically, under v a CharacterClass always (in both a top-level class and a nested class) resolves the ^ by complementing the CharSet (just like a CharacterClassEscape with an uppercase letter), and it always keeps invert==false.

With this example:

  • "aAbBcC4#".replaceAll(/\p{Ll}/giu, 'X') outputs "XXXXXX4#"
  • "aAbBcC4#".replaceAll(/[^\P{Ll}]/giu, 'X') outputs "aAbBcC4#"
  • "aAbBcC4#".replaceAll(/\p{Ll}/giv, 'X') outputs "XXXXXX4#"
  • "aAbBcC4#".replaceAll(/[^\P{Ll}]/giv, 'X') outputs "XXXXXX4#"

More generally, this makes [^\p{X}] == [\P{X}] == \P{X} and [^\P{X}] == [\p{X}] == \p{X}, whether IgnoreCase or not.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

The invert flag is there for a reason. Doing the above change would break fundamental regexps:

  • "aAbBcC".replaceAll(/[b]/giu, "X") outputs "aAXXcC"
  • "aAbBcC".replaceAll(/[^b]/giu, "X") outputs "XXbBXX"
  • "aAbBcC".replaceAll(/[b]/giv, "X") outputs "aAXXcC"
  • "aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!
  • "aAbBcC".replaceAll(/[^b]/giv, "X") currently outputs "XXbBXX". The change would make it output "XXXXXX". Oops!

[^b] includes B so it makes sense, under IgnoreCase, for this class to match both b and B.
Same as [\p{L}--b]: Includes B so should IgnoreCase-match both b and B.
Or any other class that excludes some but not all IgnoreCase kinds of "b".
[^b] should be exactly the same as [[\0-\u{10FFFF}]--b] and [[^b]] etc.
Different ways of arriving at the same class should behave consistently.

The model I had in my head for ICU case insensitive regex matching is that it should produce the same results that you would get by case-folding both the input text and the pattern, and then doing a regular match with those.

Folding the pattern is tricky, though. Transform it by

  • folding literals immediately
  • folding ranges after establishing an unfolded start and end.
  • folding positive properties \p{X} into an equivalent set expression as soon as they are encountered.
  • negated properties, \P{X}: fetch the positive property, fold, then negate, yielding a set expression.

Then process the transformed pattern normally (case sensitive, no more folding, no case closing).

ICU follows this behavior, with the the fact that the actual implementation uses case-closed sets and case-insensitive string matching being an optimization.

Taking the example from above,

"aAbBcC".replaceAll(/[^b]/giv, "X")

The transformed pattern and input string become

"aabbcc".replaceAll(/[^b]/gv, "X") which should output XXbbXX

The case-sensitive matching of [^b] would also match a B, but that will never occur because the input string is folded.

One twist in ICU's implementation is that sets use simple case folding/closure, while string literals in the pattern match with full Unicode casing.

Thanks, Andy, that's useful. It would replace the separate invert flag with early case closure of the set. More expensive while parsing the pattern, less while matching. And if we use the same early closure (before complement) for \P{X} outside of character classes, then everything should be consistent. It would be a change to the behavior of those, of course.

It still seems weird that [^b] looks like a set that should match B but under IgnoreCase does not match either b or B. Waldemar seems to like it that way.

So first we should decide what the various expressions mentioned here should match. I hope we can agree on equivalent expressions matching equivalently. I think that means getting rid of the invert flag one way or the other.

@aheninger What exactly do you mean with “full Unicode casing”? Full string case folding, where "ß" becomes "ss"? If so, do you do that on the fly while matching, or do you full-case-fold the whole input string and maintain a mapping from folded-string offsets back to input-string offsets? How do you handle partial matches, like one pattern s against one half of an input string "ß"?

Note: When we case-fold a class/set for IgnoreCase, we could add the full-case-folding strings, like [ß][ß(ss)].
However, we said we will forbid the complement of a set with strings, and we will validate early using property metadata, not actual property data.
It would probably surprise people that /[^ß]/iv should throw a SyntaxError, and the validation logic would need access to the full-case-folding mappings (or at least the set of characters that full-case-fold to multi-character strings).
The validation logic would also need property data; for example, \p{Ll} which contains ß.
It could also make an expression that works with one Unicode version fail with the next one -- when a code point that was unassigned is assigned a character with a multi-character case folding.

So best to not do that.

Note: If we wanted to do early case closure to emulate the current behavior of /[^b]/iu without the buggy invert boolean, that would be like the existing behavior of /\w/iu which also does early case closure of \w. I consider it a defect in the current spec that there is no early case closure of other CharacterClassEscapes.

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

  • consistent behavior as in our earlier proposal in #30 (comment)
  • deep, early case closure replacing the invert boolean as Andy described in #30 (comment)

Ok?

There is still an open question about Andy's approach of full case folding for matching of substrings (not code points in CharSets), see #30 (comment)

Looking over the discussion, and trying to read the tea leaves behind Waldemar's comments, I think we might be able to arrive at a consensus around the following:

  • consistent behavior as in our earlier proposal in #30 (comment)
  • deep, early case closure replacing the invert boolean as Andy described in #30 (comment)

Ok?

SGTM

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something. Waldemar, do you have more details?

During today’s TC39 meeting, @waldemarhorwat was concerned that this change breaks patterns like [^x] when the i flag is set. That is not the intention of our proposal, so either Waldemar discovered a spec bug, or we’re missing something.

Yes, I am baffled.

Current behavior of /[^x]/iu: (see CharacterSetMatcher)

  • set contains only 'x'
  • invert = true
  • matching:
    • step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
    • step i: there is a member a='x' in the set with a==cc, so found=true
    • step k: invert==true && found==true: no match
  • if the text is 't' then found=false but invert==true: yes match

Proposed behavior of /[^x]/iv using deep case closure:

  • set contains 'x'
  • case closure adds 'X' because Canonicalize('X')=='x'
  • complement: set contains all characters except 'x' and 'X'
  • invert = false (always)
  • matching:
    • step h: text 'x' or 'X' get Canonicalize(ch)ed to cc='x'
    • step i: there is not a member a in the set with a==cc, so found=false
    • step j: invert==false && found==false: no match
  • if the text is 't' then found=true but invert==false: yes match

That is, the behavior for any expression /[^...]/i should be equivalent.

What we are fixing is cases like /\p{Ll}/giu (all letters) ≠ /[^\P{Ll}]/giu (no letters)

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

Yes, I am baffled.

I was referring to #30 (comment) and #30 (comment) above.

We addressed those. The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

I spelled out the proposed algorithm in this issue's description (text at the top).
PTAL
@macchiati @aheninger @mathiasbynens @gibson042 @bmeck @waldemarhorwat @sffc @schuay @msaboff @hashseed

The algorithm sounds good and well-defined; can you provide a few example regexes and what the behavior of those regexes would be with your proposed algorithm?

Met today: Waldemar, Richard, Mark, Markus.
Waldemar and Richard agree that this proposal is good.
(After a small amendment to make the case closure symmetric. Case closure needs to add Kelvin for k, and k for Kelvin.)

I updated the draft spec changes with the agreed algorithm, modulo inadvertent bugs. Review would be good. (If you have edit access to the doc, see the version history.)