tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

only finite sets of strings

markusicu opened this issue · comments

In the TC39 meeting today (2021-may-26) there was some discussion of whether we should prepare for character classes matching infinite sets of strings.

From the start, the proposal has been to extended character classes, and supported Unicode properties, from finite sets of characters to finite sets of strings. This was the basis for the argument to use \p for properties of strings.

As an example, in UTS #51 there is a very clear distinction between

  1. an emoji zwj sequence, defined via a regular expression that matches an infinite set of strings
  2. the RGI emoji ZWJ sequence set (= the RGI_Emoji_ZWJ_Sequence property) which is a finite set of strings listed in a data file

It would be possible to support named matchers for infinite sets of strings, that is, a kind of named sub-regular-expression, but that is very different from a finite set, needs to have separate syntax, and should not be allowed inside character classes.

I agree that named matchers for infinite sets of strings could be useful, but I'm not convinced this is part of the MVP. I would prefer pursuing it as a separate follow-up proposal. @waldemarhorwat, does that match your thinking?

That said, here’s some thoughts:

It would be possible to support named matchers for infinite sets of strings, that is, a kind of named sub-regular-expression, but that is very different from a finite set, […]

Agreed.

[…] needs to have separate syntax, […]

Not sure I agree. I think we could totally use \p{…} for this as well if we decide to support this in the future. Nothing about our current proposal prevents us from doing that, since \p{SomeUnknownOrUnsupportedProperty} throws an exception.

[…] and should not be allowed inside character classes.

I’m not sure. Mark’s example of [\p{Valid_Emoji}--\p{RGI_Emoji}] seems compelling.

Proposed resolution: There is enough reserved syntax (e.g., curly braces) to enable wide-ranging extensions in the future, but we don't plan to build something specific into the proposed spec changes.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff