tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

String composition syntax

sffc opened this issue · comments

We have established that sets of strings need to be bounded.

Can we introduce helper syntax to make large-but-bounded sets of strings more friendly to write?

For example:

  1. Question mark: /[(ab?c)]/ == /[(ac|abc)]/
  2. Braces with a minimum and maximum: /[(ab{1,3}c)]/ == /[(ac|abc|abbc|abbbc)]/
  3. Prefixing or suffixing of nested sets: /[(a[bc]d)]/ == /[(abd|acd)]/
  4. Prefixing or suffixing of nested properties: /[(a\p{L}d)]/ == the set of all letters prefixed with "a" and suffixed with "d"

I can think of a number of use cases:

  • Emoji sequences where the skin-tone code point is optional
  • Alternate spellings of words
  • Regex compression

My initial reaction:

  1. Oh no, not another complication!
  2. Something like this might be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time.
  3. For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data.
  4. I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals.

I agree that this would be a really nice feature but implementing it will be difficult. E.g. /[([ab]{64})]/ will resolve into 264 strings. This (really nice) feature will probably take us further away from ICU's UnicodeSet which (to my knowledge) is implemented as a character set and a list of strings.

Are we going to allow general regexps inside character classes? That seems to be the simplest way of doing this. If we go that route:

  • We need to allow character classes to include infinite sets.
  • Things like backtracking evaluation order and capturing parentheses will get weird.

Are we going to allow general regexps inside character classes?

I really do not want to go there.
And that's not what Shane is suggesting here -- “We have established that sets of strings need to be bounded.” -- he is “only” suggesting ways to abbreviate a finite list of literal strings.

That is, if and when we do this, we would end up with an algorithm in the spec for how to expand strings-with-wildcards into a fixed set of strings, rather than turning the result of a character class into some sort of nested-regex matcher.

As Michael points out, depending on what wildcards are supported, this could easily yield an astronomical number of strings and thus eat a lot of memory, so we should think about security implications.

depending on what wildcards are supported, this could easily yield an astronomical number of strings

Almost any wildcard that resolves into >1 strings can be used to cause a combinatorial explosion. I don't think that there are any useful wildcards that can be implemented safely if they all get de-sugared into strings.

Examples:

  • Character class: /[([a-z][a-z][a-z][a-z])]/ accepts 264 strings.
  • Character set: /[(\w\w\w\w)]/ accepts 634 strings.
  • Single character set + suffix: /[(\Wa)]/ accepts >1M strings.
  • Quantifier: /[(a{1,100}b{1,100}c{1,100}d{1,100})]/ accepts 1004 strings.
  • Question mark quantifier: /[(0?1?2?3?4?5?6?7?8?9?_0?1?2?3?4?5?6?7?8?9?)]/ accepts 220 strings.

+1 to exploring this further as a separate follow-up proposal.

We don’t need to do anything special as part of this proposal since \X (where X is an ASCII letter that currently doesn’t have a special escape sequence) is already reserved in the current upstream spec in u mode (and will also be in v mode). (We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)

If after further investigation we decide to add this functionality, we could handle it by adding a new prefix alongside \q{…} (for simple strings).

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to not pursue these ideas in this proposal.
A future proposal could introduce string-range/abbreviation syntax using either parentheses or a backslash-with-new-letter combination different from \q{...}.