String composition syntax

Question

String composition syntax

sffc opened this issue 3 years ago · comments

Shane F. Carr commented 3 years ago

We have established that sets of strings need to be bounded.

Can we introduce helper syntax to make large-but-bounded sets of strings more friendly to write?

For example:

Question mark: /[(ab?c)]/ == /[(ac|abc)]/
Braces with a minimum and maximum: /[(ab{1,3}c)]/ == /[(ac|abc|abbc|abbbc)]/
Prefixing or suffixing of nested sets: /[(a[bc]d)]/ == /[(abd|acd)]/
Prefixing or suffixing of nested properties: /[(a\p{L}d)]/ == the set of all letters prefixed with "a" and suffixed with "d"

I can think of a number of use cases:

Emoji sequences where the skin-tone code point is optional
Alternate spellings of words
Regex compression

Markus Scherer · Answer 1 · Fri Sep 24 2021 02:34:46 GMT+0800 (China Standard Time)

My initial reaction:

Oh no, not another complication!
Something like this might be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time.
For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data.
I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals.

Michael Schmidt · Answer 2 · Fri Sep 24 2021 02:37:27 GMT+0800 (China Standard Time)

I agree that this would be a really nice feature but implementing it will be difficult. E.g. /[([ab]{64})]/ will resolve into 2⁶⁴ strings. This (really nice) feature will probably take us further away from ICU's UnicodeSet which (to my knowledge) is implemented as a character set and a list of strings.

Mark Davis · Answer 3 · Fri Sep 24 2021 02:43:41 GMT+0800 (China Standard Time)

My initial reaction is much like Markus's; let's not let this delay or derail the current proposal. Mark

…

On Thu, Sep 23, 2021 at 11:34 AM Markus Scherer ***@***.***> wrote: My initial reaction: 1. Oh no, not another complication! 2. *Something* like this *might* be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time. 3. For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data. 4. I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMB3AXPZVEITRVL5KWLUDNXNFANCNFSM5EUGL4ZQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Waldemar Horwat · Answer 4 · Fri Sep 24 2021 04:46:37 GMT+0800 (China Standard Time)

Are we going to allow general regexps inside character classes? That seems to be the simplest way of doing this. If we go that route:

We need to allow character classes to include infinite sets.
Things like backtracking evaluation order and capturing parentheses will get weird.

Markus Scherer · Answer 5 · Fri Sep 24 2021 05:08:43 GMT+0800 (China Standard Time)

Are we going to allow general regexps inside character classes?

I really do not want to go there.
And that's not what Shane is suggesting here -- “We have established that sets of strings need to be bounded.” -- he is “only” suggesting ways to abbreviate a finite list of literal strings.

That is, if and when we do this, we would end up with an algorithm in the spec for how to expand strings-with-wildcards into a fixed set of strings, rather than turning the result of a character class into some sort of nested-regex matcher.

As Michael points out, depending on what wildcards are supported, this could easily yield an astronomical number of strings and thus eat a lot of memory, so we should think about security implications.

Michael Schmidt · Answer 6 · Fri Sep 24 2021 05:54:59 GMT+0800 (China Standard Time)

depending on what wildcards are supported, this could easily yield an astronomical number of strings

Almost any wildcard that resolves into >1 strings can be used to cause a combinatorial explosion. I don't think that there are any useful wildcards that can be implemented safely if they all get de-sugared into strings.

Examples:

Character class: /[([a-z][a-z][a-z][a-z])]/ accepts 26⁴ strings.
Character set: /[(\w\w\w\w)]/ accepts 63⁴ strings.
Single character set + suffix: /[(\Wa)]/ accepts >1M strings.
Quantifier: /[(a{1,100}b{1,100}c{1,100}d{1,100})]/ accepts 100⁴ strings.
Question mark quantifier: /[(0?1?2?3?4?5?6?7?8?9?_0?1?2?3?4?5?6?7?8?9?)]/ accepts 2²⁰ strings.

Mathias Bynens · Answer 7 · Fri Sep 24 2021 14:58:48 GMT+0800 (China Standard Time)

+1 to exploring this further as a separate follow-up proposal.

We don’t need to do anything special as part of this proposal since \X (where X is an ASCII letter that currently doesn’t have a special escape sequence) is already reserved in the current upstream spec in u mode (and will also be in v mode). (We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)

If after further investigation we decide to add this functionality, we could handle it by adding a new prefix alongside \q{…} (for simple strings).

Markus Scherer · Answer 8 · Fri Oct 01 2021 06:43:52 GMT+0800 (China Standard Time)

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to not pursue these ideas in this proposal.
A future proposal could introduce string-range/abbreviation syntax using either parentheses or a backslash-with-new-letter combination different from \q{...}.