subtract a not-proper-subset

Question

subtract a not-proper-subset

markusicu opened this issue 3 years ago · comments

In the TC39 meeting this week (2021-may-26) someone asked what happens in subtraction A--B when B is not a proper subset of A.

My response was that this would be fine, just like in math, and that this is how people implement sets in general in computers. {ab}--{bc}={a}
The person asking felt like this is somehow different from math sets and might mask a bug in the pattern.
Mathias responded that it's important to support such a case, so that a regex author can make sure that a character class definitely does not match a certain set of things, without jumping through hoops to make sure that that is a proper subset. (This could require a nested intersection with the left-side set: [A--[B&&A]] rather than just [A--B].)

In my mind, character classes define sets in the mathematical sense, and set operations should behave as usual. From the current spec: “A CharSet is a mathematical set of characters.” and the semantics evaluate a CharacterClass and a CharacterClassEscape to a CharSet, using union operations as appropriate (“Return the union of ...”).

Mathias Bynens commented 3 years ago

PR: #36

Mathias Bynens · Answer 1 · Sat May 29 2021 00:47:27 GMT+0800 (China Standard Time)

@ljharb, does this rationale address your concerns? Should we add this to the FAQ or discuss it further?

Jordan Harband · Answer 2 · Tue Jun 01 2021 05:30:21 GMT+0800 (China Standard Time)

I'd certainly add it to the FAQ.

I'm not 100% convinced - programming simply isn't math, and the rules of math don't need to apply in programming - but I don't feel strongly enough to insist on any changes at this time.

Mathias Bynens · Answer 3 · Tue Jun 01 2021 14:12:36 GMT+0800 (China Standard Time)

The example from the slides was:

[\p{RGI_Emoji}--(🇧🇪)]
// → matches 👧🏿 and 🇫🇷 but not 🇧🇪

Perhaps a more illustrative example for this specific concern is the following:

[\p{SomePropertyOfStrings}--(foo)]

The set of strings to which SomePropertyOfStrings expands is defined by Unicode. It seems like a bad idea to make this snippet go from being valid (as long as SomePropertyOfStrings does not include the string foo) to being invalid and throwing an exception when Unicode decides to change SomePropertyOfStrings.

Jordan Harband · Answer 4 · Tue Jun 01 2021 14:24:47 GMT+0800 (China Standard Time)

Is that kind of change likely? Wouldn’t removal of a character from one of these sets be a likely breaking change to JS anyways, especially with this proposal increasing reliance on set membership?

Mark Davis · Answer 5 · Tue Jun 01 2021 22:04:39 GMT+0800 (China Standard Time)

That kind of change isn't, but many are, because Characters are continually being added, and properties do get refined over time especially for longer tail scripts. You really don't want the following to suddenly throw an exception or have other bizarre behavior: [\p{prop1}--\p{prop2}] Moreover there are lots of circumstances where you don't want to have to figure out whether something is a proper subset or not. It would just make expressions needlessly complicated. [\p{script=greek}--\p{lowercase_letter}] That also goes for intersection as well as set subtraction. [\p{prop1}&&\p{prop2}]

…

On Mon, May 31, 2021, 23:25 Jordan Harband ***@***.***> wrote: Is that kind of change likely? Wouldn’t removal of a character from one of these sets be a likely breaking change to JS anyways, especially with this proposal increasing reliance on set membership?

Mathias Bynens · Answer 6 · Tue Jun 01 2021 22:07:05 GMT+0800 (China Standard Time)

It would also be an implementation/performance concern, as it’d require knowledge of all strings in each set at parse time — something we’ve explicitly been trying to avoid in our proposal.

Jordan Harband · Answer 7 · Tue Jun 01 2021 23:24:43 GMT+0800 (China Standard Time)

Characters being added is one of my concerns. If the character suddenly gets added, then any subtraction can go from a noop to removing a character, without the pattern or flags changing.

Markus Scherer · Answer 8 · Tue Jun 01 2021 23:36:35 GMT+0800 (China Standard Time)

Note that the question of subtracting a not-proper-subset applies to properties of characters just as much as to properties of strings.

programming simply isn't math, and the rules of math don't need to apply in programming

In general, this is true, but I am not aware of a commonly used set implementation in programming that does not adhere to what people think of as sets in the mathematical sense; in particular, I am not aware of any that treats removal of a key that is not in the set as an error.

In regular expression engines that support set subtraction, I also don't see any mention of treating this as an error. They are defined as you would expect. See links to several here: https://github.com/tc39/proposal-regexp-set-notation#whats-the-precedent-in-other-regexp-flavors

Example spec text from XML Schema: “For any ·positive character group· or ·negative character group· G, and any ·character class expression· C, G-C is a valid ·character class subtraction·, identifying the set of all characters in C(G) that are not also in C(C).”

In the .Net regex, limited as the syntax is, there are actually examples that subtract a not-proper-subset:

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#CharacterClassSubtraction
Useful example: [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]] (subtracts punctuation characters, including supplementary ones, from the BMP range)
Example of something not useful but not forbidden: [a-z-[0-9]]

Markus Scherer · Answer 9 · Tue Jun 01 2021 23:49:04 GMT+0800 (China Standard Time)

Characters being added is one of my concerns. If the character suddenly gets added, then any subtraction can go from a noop to removing a character, without the pattern or flags changing.

When you use properties, you want the regular expression to follow along with Unicode versions. If you want perfect stability, then you need to hardcode all ranges. It's a trade-off, stability vs. auto-updating. When dealing with natural languages, there is a benefit to using p{L} vs. the current extension of [a-zA-Z].

Trivial example: Pick a range of code points that is unassigned now but where Unicode 14 adds a new script: [[\u{16A70}-\u{16ACF}]--\p{L}] This is currently a "noop" but once implementations support Unicode 14 it "goes to removing characters". For someone who writes something like this, it might very well be what they want. (For example, picking sample code points that are not letters.)

Markus Scherer · Answer 10 · Wed Jun 02 2021 00:04:02 GMT+0800 (China Standard Time)

Examples from https://github.com/tc39/proposal-regexp-set-notation#illustrative-examples which we got from real code:

[\p{White_Space}--\p{Line_Break=Glue}]
[\p{Emoji}--\p{ASCII}]
[\P{NFC_Quick_Check=No}--\p{Script=Common}--\p{Script=Inherited}--\p{Script=Unknown}]
[[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]

It would be very awkward having to write these as subtracting proper subsets.

Markus Scherer · Answer 11 · Fri Jul 09 2021 01:53:28 GMT+0800 (China Standard Time)

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

I feel strongly that we should treat character classes like mathematical sets, as the spec says already, and as motivated by the examples above and by comparison with other regex implementations. Thus I suggest that we add this to the FAQ and then close this issue.