tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

subtract a not-proper-subset

markusicu opened this issue · comments

In the TC39 meeting this week (2021-may-26) someone asked what happens in subtraction A--B when B is not a proper subset of A.

  • My response was that this would be fine, just like in math, and that this is how people implement sets in general in computers. {ab}--{bc}={a}
  • The person asking felt like this is somehow different from math sets and might mask a bug in the pattern.
  • Mathias responded that it's important to support such a case, so that a regex author can make sure that a character class definitely does not match a certain set of things, without jumping through hoops to make sure that that is a proper subset. (This could require a nested intersection with the left-side set: [A--[B&&A]] rather than just [A--B].)

In my mind, character classes define sets in the mathematical sense, and set operations should behave as usual. From the current spec: “A CharSet is a mathematical set of characters.” and the semantics evaluate a CharacterClass and a CharacterClassEscape to a CharSet, using union operations as appropriate (“Return the union of ...”).

@ljharb, does this rationale address your concerns? Should we add this to the FAQ or discuss it further?

I'd certainly add it to the FAQ.

I'm not 100% convinced - programming simply isn't math, and the rules of math don't need to apply in programming - but I don't feel strongly enough to insist on any changes at this time.

The example from the slides was:

[\p{RGI_Emoji}--(🇧🇪)]
// → matches 👧🏿 and 🇫🇷 but not 🇧🇪 

Perhaps a more illustrative example for this specific concern is the following:

[\p{SomePropertyOfStrings}--(foo)]

The set of strings to which SomePropertyOfStrings expands is defined by Unicode. It seems like a bad idea to make this snippet go from being valid (as long as SomePropertyOfStrings does not include the string foo) to being invalid and throwing an exception when Unicode decides to change SomePropertyOfStrings.

Is that kind of change likely? Wouldn’t removal of a character from one of these sets be a likely breaking change to JS anyways, especially with this proposal increasing reliance on set membership?

It would also be an implementation/performance concern, as it’d require knowledge of all strings in each set at parse time — something we’ve explicitly been trying to avoid in our proposal.

Characters being added is one of my concerns. If the character suddenly gets added, then any subtraction can go from a noop to removing a character, without the pattern or flags changing.

Note that the question of subtracting a not-proper-subset applies to properties of characters just as much as to properties of strings.

programming simply isn't math, and the rules of math don't need to apply in programming

In general, this is true, but I am not aware of a commonly used set implementation in programming that does not adhere to what people think of as sets in the mathematical sense; in particular, I am not aware of any that treats removal of a key that is not in the set as an error.

In regular expression engines that support set subtraction, I also don't see any mention of treating this as an error. They are defined as you would expect. See links to several here: https://github.com/tc39/proposal-regexp-set-notation#whats-the-precedent-in-other-regexp-flavors

Example spec text from XML Schema: “For any ·positive character group· or ·negative character group· G, and any ·character class expression· C, G-C is a valid ·character class subtraction·, identifying the set of all characters in C(G) that are not also in C(C).”

In the .Net regex, limited as the syntax is, there are actually examples that subtract a not-proper-subset:

Characters being added is one of my concerns. If the character suddenly gets added, then any subtraction can go from a noop to removing a character, without the pattern or flags changing.

When you use properties, you want the regular expression to follow along with Unicode versions. If you want perfect stability, then you need to hardcode all ranges. It's a trade-off, stability vs. auto-updating. When dealing with natural languages, there is a benefit to using p{L} vs. the current extension of [a-zA-Z].

Trivial example: Pick a range of code points that is unassigned now but where Unicode 14 adds a new script: [[\u{16A70}-\u{16ACF}]--\p{L}] This is currently a "noop" but once implementations support Unicode 14 it "goes to removing characters". For someone who writes something like this, it might very well be what they want. (For example, picking sample code points that are not letters.)

Examples from https://github.com/tc39/proposal-regexp-set-notation#illustrative-examples which we got from real code:

  • [\p{White_Space}--\p{Line_Break=Glue}]
  • [\p{Emoji}--\p{ASCII}]
  • [\P{NFC_Quick_Check=No}--\p{Script=Common}--\p{Script=Inherited}--\p{Script=Unknown}]
  • [[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]

It would be very awkward having to write these as subtracting proper subsets.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

I feel strongly that we should treat character classes like mathematical sets, as the spec says already, and as motivated by the examples above and by comparison with other regex implementations. Thus I suggest that we add this to the FAQ and then close this issue.