More syntax characters should be forbidden in ClassSyntaxCharacter

Question

More syntax characters should be forbidden in ClassSyntaxCharacter

sffc opened this issue 3 years ago · comments

/(a*)/ matches strings with zero or more "a". But currently /[(a*)]/v matches the literal string "a*".

I think we should try to be consistent where possible on the matching behavior of alternations () outside of character classes and sets of strings [()] (ClassStrings) inside of character classes, because wrapping a string alternation with [] should not cause the matching behavior to change in surprising ways. Concretely, I would like us to require escaping of all SyntaxCharacter in ClassSyntaxCharacter or at least in NonEmptyClassString.

Summarizing the position of other champions based on our discussion:

@mathiasbynens has pointed out that the behavior of alternations and ClassStrings already differs in the sense that alternations create capturing groups, but ClassStrings do not.

I am not as concerned about that behavior difference because it is easily discoverable and only affects the return value after matching has been performed.

@macchiati has pushed back on requiring more syntax characters to be escaped.

I come from the other angle, which is that we should err on the side of requiring escapes. Also, I could see us wanting to make these syntax characters have alternative behavior in the future (see #26).

@markusicu has advocated for keeping the definition of ClassSyntaxCharacter consistent both inside and outside of ClassStrings within the context of a character class. He points out that syntax characters like *, +, ?, etc., are interpreted as literals in character classes already.

My preference would be to require them to be escaped everywhere in a character class ([\*(\*)] instead of [*(*)]), but I would be okay with only requiring the escape in ClassStrings ([*(\*)]).

Markus Scherer · Answer 1 · Fri Jun 25 2021 01:00:40 GMT+0800 (China Standard Time)

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

Mark Davis · Answer 2 · Fri Jun 25 2021 08:07:55 GMT+0800 (China Standard Time)

I think trying to have identical syntax inside and outside of character classes at this point muddies the water, and has the potential to cause people to give up on strings inside character classes.

Just because some expression X (eg, /(a*(b+a)|b*a|\p{emoji})/) would work outside of a character class doesn't mean that expression X should work inside a character class (eg, /[(a*(b+a)|b*a|\p{emoji})]/), at least without an extensive analysis of what all the implications are, and whether it would be useful.

There is a non-zero cognitive cost to requiring escapes on characters.

Markus Scherer · Answer 3 · Fri Jun 25 2021 08:17:46 GMT+0800 (China Standard Time)

I think trying to have identical syntax inside and outside of character classes at this point muddies the water, and has the potential to cause people to give up on strings inside character classes.

Just because some expression X (eg, /(a*(b+a)|b_a|\p{emoji})/) would work outside of a character class doesn't mean that expression X should work inside a character class (eg, /[(a_(b+a)|b*a|\p{emoji})]/), at least without an extensive analysis of what all the implications are, and whether it would be useful.

I think Shane is lobbying for throwing a syntax error for what looks like match operators but is put inside a character class string, rather than have it be silently accepted as literal string contents.

I myself am skeptical that regex authors would be confused here.

There is a non-zero cognitive cost to requiring escapes on characters.

I agree.

Looking for balance here, hoping for feedback from stage 3 reviewers.

Shane F. Carr · Answer 4 · Fri Jun 25 2021 09:32:10 GMT+0800 (China Standard Time)

I think Shane is lobbying for throwing a syntax error for what looks like match operators but is put inside a character class string, rather than have it be silently accepted as literal string contents.

Right. From a cognitive burden point of view, I think having something silently accepted that has surprising behavior is far worse than a few extra escape characters. To be clear, the characters we're talking about escaping already need to be escaped outside a character class, and they won't appear in every regular expression, so I am skeptical that there is a measurable cost to require these escapes.

Markus Scherer · Answer 5 · Fri Jun 25 2021 10:13:54 GMT+0800 (China Standard Time)

To be clear, the characters we're talking about escaping already need to be escaped outside a character class, and they won't appear in every regular expression, so I am skeptical that there is a measurable cost to require these escapes.

They don't currently need to be escaped inside a character class.

It seems like we have three choices (note, all of this is for inside a character class):

Keep it as is. Your hunch is that regex authors will be confused that * and ? and such become part of string literals rather than working as match operators, because of the (string|literal) syntax with parentheses. I am skeptical about that, since * and ? etc. are just literals in character classes today.
Require escaping all SyntaxCharacters to be escaped anywhere inside a character class. Mark's hunch is that that is too onerous. We are in fact requiring more characters to be escaped than before, but this adds more to that list.
Require escaping all SyntaxCharacters to be escaped only inside a NonEmptyClassString but not elsewhere in a character class. Someone at the TC39 meeting voiced a desire for escaping to be more consistent across different parts of regular expressions, so this would create an inconsistency.

Shane F. Carr · Answer 6 · Fri Jun 25 2021 10:36:14 GMT+0800 (China Standard Time)

Yes, I think those are the three options.

Let's put it this way. Consider these four "zones": /A(B)[C(D)]/

Zone A: Outside of a character class
Zone B: Inside an alternation (but outside of a character class)
Zone C: Inside of a character class (but not in a NonEmptyClassString)
Zone D: Inside a NonEmptyClassString in a character class

We currently have two sets of escaping rules: one applies to zones A and B, and the other applies to zones C and D. The main difference is that *, +, ?, and a few other characters that are syntax in zones A and B are interpreted as literals in zones C and D. This is Markus's Option 1.

I am making the claim that it is surprising that the escaping rules are different between zones B and D, since they both look similar in the regular expression. So the minimal change would be to restrict zone D's escaping rules to be more like A and B (with the addition of |); that's Markus's Option 3.

However, for consistency's sake, the best route may be to unify all four of these zones to the same set of escape rules. This is Markus's Option 2. A sub-question of Option 2 would be whether we should also require | to be escaped in zones A and C, such that the four zones are actually 100% equivalent in their escape rules.

Mathias Bynens · Answer 7 · Fri Jun 25 2021 20:47:56 GMT+0800 (China Standard Time)

@mathiasbynens has pointed out that the behavior of alternations and ClassStrings already differs in the sense that alternations create capturing groups, but ClassStrings do not.

Another difference is that (x|xy|xyz) matches “longest string first” in ClassStrings, but in source order elsewhere. (I don’t think this is confusing, however — character classes have always been a wildly different context within regular expression patterns.)

Waldemar Horwat · Answer 8 · Wed Jun 30 2021 06:54:31 GMT+0800 (China Standard Time)

I am skeptical of requiring characters that do not have special meaning inside character classes to be escaped there.

This would break some commonly used idioms for no good reason: some folks like to write /foo[.]bar[.]baz/ as a way to escape the .'s, and I see little harm with letting them continue that practice.

Shane F. Carr · Answer 9 · Thu Jul 01 2021 03:56:59 GMT+0800 (China Standard Time)

@waldemarhorwat Does your comment relate to both zones C and D or only zone C (from #33 (comment))? In other words, would you be okay with requiring escapes within ClassStrings, such as:

/foo[(\.)]bar/v

Waldemar Horwat · Answer 10 · Fri Jul 02 2021 08:55:14 GMT+0800 (China Standard Time)

Why would you want to do that?

Shane F. Carr · Answer 11 · Fri Jul 02 2021 14:16:45 GMT+0800 (China Standard Time)

Why would you want to do that?

See a few comments up: #33 (comment)

Waldemar Horwat · Answer 12 · Sat Jul 03 2021 07:33:07 GMT+0800 (China Standard Time)

I still don't see the rationale.

Escape syntax currently depends only on whether one is inside or outside []. It doesn't change between inside or outside () — you're using two letters (A and B) to describe what is a single zone as far as escaping *, ., etc. is concerned. I see no reason to introduce artificial syntactic gotchas depending on being inside or outside (). We don't have that distinction in the regex language now and shouldn't add it.

Markus Scherer · Answer 13 · Fri Jul 09 2021 01:45:31 GMT+0800 (China Standard Time)

I still don't see the rationale.

Let me just explain where Shane is coming from, since he is out for a bit and I want to make progress on our list of issues. I am not endorsing Shane's suggestion.

Shane is looking at our use of (literal|strings) inside character classes and contends that that looks like an alternation. He thinks that regex authors might try to stick arbitrary regexes into this string literal syntax, for example [(a*)], and expect the class to match what those "expressions" match when they don't get a SyntaxError. So he wants to require escaping SyntaxCharacters at least in string literals.

We had chosen the round-parentheses-with-pipe syntax for string literals deliberately to make it look a little like an alternation, and most of us are not worried about regex authors confusing real alternations outside of character classes with the string literal syntax inside.

Richard Gibson · Answer 14 · Wed Jul 14 2021 17:17:42 GMT+0800 (China Standard Time)

I agree with @sffc about the risk here. Introducing structure inside a character class where none existed before will suggest to at least some practitioners that even more metacharacters have special meaning, and absence of syntax errors in such cases will let unintended regular expressions slip by—heck, just lack of sleep would probably be sufficient for me to misinterpret something like [\p{RGI_Emoji}--\p{ASCII}--(...)].

Mickey Rose · Answer 15 · Mon Aug 09 2021 21:12:51 GMT+0800 (China Standard Time)

A sub-question of Option 2 would be whether we should also require | to be escaped in zones A and C, such that the four zones are actually 100% equivalent in their escape rules.

| is a metacharacter in zone A, so it already needs to be escaped. Making escaping rules 100% equivalent across the whole regular expression would create the clunkiest regex flavour in existence. Not only you'd have to escape [.?+*{|] in C, you'd also have to escape - in A because it's a metacharacter in C (along with all the set operators).

We had chosen the round-parentheses-with-pipe syntax for string literals deliberately to make it look a little like an alternation

Alternation has nothing to do with parentheses, it's just the pipe. Instead of making parentheses special, you could add another modifier like ^ to change the interpretation of what follows:

[|x|xy|xyz]
// instead of
[(x|xy|xyz)]

Shane F. Carr · Answer 16 · Sat Aug 28 2021 06:23:13 GMT+0800 (China Standard Time)

The primary point of disagreement on this issue is the premise of the OP, that practitioners may experience unexpected behavior on /[(a*)b*]/. I have therefore scheduled this topic for the next TC39 Research Call on September 9.

Markus Scherer · Answer 17 · Fri Sep 17 2021 00:43:10 GMT+0800 (China Standard Time)

We brainstormed about “researching” this topic in the TC39 Research Call last week. The main action item from it was “Team needs to agree on value, path we want to take”

We discussed it this morning in the team meeting.

@macchiati feels strongly that this research would not be a good use of time; very difficult to measure, takes a lot of time, need to iterate, select the right people to do the test
@gibson042 agreed that this would be difficult to do
We think that we should move forward as is, but if there is still a strong concern, then we should fall back to the earlier-proposed {string|literal|syntax} with curly braces – making it again look less like an alternation (and more like in ICU and UTS 18).

Shane F. Carr · Answer 18 · Fri Sep 17 2021 02:59:15 GMT+0800 (China Standard Time)

I do not agree with the above post from Markus.

We had a very productive meeting last week in the TC39 Research Call. There was strong sentiment from that meeting that this is a very worthwhile topic to pursue.
Although there is some work on our end, we have help for writing the questions, running the survey, and gathering the data.
We should neither close this issue nor make any other changes to the syntax without gathering the data first.

Mathias Bynens · Answer 19 · Mon Sep 20 2021 18:40:37 GMT+0800 (China Standard Time)

We don’t need additional research to know that

there is strong pushback against requiring more escaping (see earlier comments in this thread)
we could address Shane’s concerns by reverting to the earlier {string|literal|syntax}

Given that, I would strongly prefer not spending time researching options other than those already on the table:

stick with (string|literal|syntax)
revert to {string|literal|syntax}

Shane F. Carr · Answer 20 · Tue Sep 21 2021 02:13:49 GMT+0800 (China Standard Time)

My concern is about standardizing a syntax that will have surprising behavior to practitioners reading and writing regular expressions.

My concern applies to both of those syntaxes:

[{a*|b*|c*}] for reasons discussed in #17 (comment): I claim that it is misleading for curly braces to not have a specifier character preceding them.
[(a*|b*|c*)] for reasons discussed in this thread: I claim that it is misleading for parentheses to have different syntax character rules depending on their position in the regular expression

In other words, "reverting" to the curly brace syntax does not address my concern.

My concerns are based on hypotheses. The reason data acquisition is appealing to me is that it would help validate or invalidate my hypotheses.

Markus Scherer · Answer 21 · Tue Sep 21 2021 03:05:34 GMT+0800 (China Standard Time)

[(a*|b*|c*)] for reasons discussed in this thread: I claim that it is misleading for parentheses to have different syntax character rules depending on their position in the regular expression

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers (a{3,5}) and for enclosing details of \u, \p, and \P (and elsewhere also \b{g} etc.). Inside character classes, they are currently all just literal characters.

This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes.

Markus Scherer · Answer 22 · Tue Sep 21 2021 03:14:35 GMT+0800 (China Standard Time)

PS:

We know that several people really don't want to require more escaping than we need. I have a preference for consistent escaping inside of character classes. But I could live with more escaping inside [(string|literal|syntax)] than in the rest of a character class.

Also, we have settled before on the string literal syntax, but I could live with the more verbose [\q{string|literal|syntax}], with the \q prefix, as suggested in UTS 18. If we did go back to this one, then I think we should not need the additional escaping.

Mark Davis · Answer 23 · Tue Sep 21 2021 08:17:50 GMT+0800 (China Standard Time)

Re #17 (comment) <#17 (comment)> : *I claim that* it is misleading for curly braces to not have a specifier character preceding them. Looking at: "I'm not a fan of bare {} because when I read {}, I expect to find a modifier that tells me what the {} means. If there is a regular expression like /[p{letter}]/v, it looks like I am looking up a Unicode property "letter", but actually I am just matching the alternation "p" and "letter"." People using regex seem to have very little problem with realizing that characters inside a CC and outside a CC are different: that in [a*] the * is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly like [\p{letter}] would be a user error. Given that people don't like excessive escaping, I think the choices at this point are clear a. stick with (string|literal|syntax) b. revert to {string|literal|syntax} c. reverting further to \q{string|literal|syntax} I could live with any of these, but prefer (a) since that is what we had settled on. Mark

…

On Mon, Sep 20, 2021 at 12:14 PM Markus Scherer ***@***.***> wrote: PS: We know that several people really don't want to require more escaping than we need. I have a preference for consistent escaping inside of character classes. But *I could live with* more escaping inside [(string|literal|syntax)] than in the rest of a character class. Also, we have settled before on the string literal syntax, but *I could live with* the more verbose [\q{string|literal|syntax}], with the \q prefix, as suggested in UTS 18. If we did go back to this one, then I think we should not need the additional escaping. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGCYEGKSD7CYVCZOC3UC6B2NANCNFSM47IIOYOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Shane F. Carr · Answer 24 · Tue Sep 21 2021 09:46:05 GMT+0800 (China Standard Time)

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers (a{3,5}) and for enclosing details of \u, \p, and \P (and elsewhere also \b{g} etc.). Inside character classes, they are currently all just literal characters.

This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes.

Thank you for this comment, which presents a counter-argument for my hypothesis.

People using regex seem to have very little problem with realizing that characters inside a CC and outside a CC are different: that in [a*] the * is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly like [\p{letter}] would be a user error.

Yes, a user error. I firmly believe that an important part of our job as spec authors is to design a syntax that is resistant to user errors.

Given that people don't like excessive escaping, I think the choices at this point are clear

Let's keep all options on the table so that we know clearly what we are working with.

a. stick with (string|literal|syntax)
b. revert to {string|literal|syntax}
c. reverting further to \q{string|literal|syntax}
d. amend (string|literal|syntax) with more escape rules (multiple ways to do that)

My hypothesis is that (a) will cause confusion to practitioners. @markusicu has offered a counter-argument.

My hypothesis is that (b) will also cause confusion to practitioners. @macchiati agrees. I feel more strongly about this hypothesis than the previous one.

I do not perceive substantial risk for (c) causing confusion to practitioners, but it comes with a (fairly small) ergonomics cost.

My hypothesis remains that (d) offers the best balance between ergonomics and understandability.
@macchiati is opposed, based on the assertion that it requires "excessive escaping," hurting ergonomics. I disagree with that assertion since we are talking about only a handful of syntax characters, and those characters are not particularly common (a claim we could quantify if needed).

So I really see two paths forward:

Agree as a champions group that the premise of this issue, the hypothesis that [(a*|b*|c*)] will cause confusion for practitioners, is false, and close the issue.
Agree as a champions group that the premise of the issue is true, and then choose one of the other choices that we have on the table. It seems that (c) is the most likely fallback option.

The reason I suggested bringing this to the TC39 Research Call was to validate the premise of this issue. I do not believe strongly enough in my hypothesis to suggest that we revert to option (c) without seeing additional data.

Mark Davis · Answer 25 · Tue Sep 21 2021 12:18:18 GMT+0800 (China Standard Time)

My hypothesis is that (b) will also cause confusion to practitioners.

@macchiati <https://github.com/macchiati> agrees. I feel more strongly about this hypothesis than the previous one. That isn't what I agreed with (sounds like I wasn't clear). Strictly in terms of clarity, I think 1. \q{...} is best (but slightly more awkward) 2. {...} is somewhat worse than \q 3. (...) is somewhat worse than {...} but I could live with any of them. I don't think that escaping is required for any of them — except of course that 1. requires | and } be escaped inside, that is, after \q{ 2. requires { be escaped outside, and | and } inside 3. requires ( be escaped outside, and | and ) inside and \ itself, of course. I only hear one person so far saying that "[(a*|b*|c*)] will cause confusion for practitioners" Mark

…

On Mon, Sep 20, 2021 at 6:46 PM Shane F. Carr ***@***.***> wrote: Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers (a{3,5}) and for enclosing details of \u, \p, and \P (and elsewhere also \b{g} etc.). Inside character classes, they are currently all just literal characters. This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes. Thank you for this comment, which presents a counter-argument for my hypothesis. People using regex seem to have very little problem with realizing that characters inside a CC and outside a CC are different: that in [a*] the * is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly like [\p{letter}] would be a user error. Yes, a user error. I firmly believe that an important part of our job as spec authors is to design a syntax that is resistant to user errors. Given that people don't like excessive escaping, I think the choices at this point are clear Let's keep all options on the table so that we know clearly what we are working with. a. stick with (string|literal|syntax) b. revert to {string|literal|syntax} c. reverting further to \q{string|literal|syntax} d. amend (string|literal|syntax) with more escape rules (multiple ways to do that) My hypothesis is that (a) will cause confusion to practitioners. @markusicu <https://github.com/markusicu> has offered a counter-argument. My hypothesis is that (b) will also cause confusion to practitioners. @macchiati <https://github.com/macchiati> agrees. I feel more strongly about this hypothesis than the previous one. I do not perceive substantial risk for (c) causing confusion to practitioners, but it comes with a (fairly small) ergonomics cost. My hypothesis remains that (d) offers the best balance between ergonomics and understandability. @macchiati <https://github.com/macchiati> is opposed, based on the assertion that it requires "excessive escaping," hurting ergonomics. I disagree with that assertion since we are talking about only a handful of syntax characters, and those characters are not particularly common (a claim we could quantify if needed). So I really see two paths forward: 1. Agree as a champions group that the premise of this issue, the hypothesis that [(a*|b*|c*)] will cause confusion for practitioners, is false, and close the issue. 2. Agree as a champions group that the premise of the issue is true, and then choose one of the other choices that we have on the table. It seems that (c) is the most likely fallback option. The reason I suggested bringing this to the TC39 Research Call was to validate the premise of this issue. I do not believe strongly enough in my hypothesis to suggest that we revert to option (c) without seeing additional data. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGNYGI7FZUYRIMXU23UC7PWRANCNFSM47IIOYOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Shane F. Carr · Answer 26 · Tue Sep 21 2021 13:01:53 GMT+0800 (China Standard Time)

That isn't what I agreed with (sounds like I wasn't clear).

Sorry for misunderstanding your position.

I only hear one person so far saying that "[(a*|b*|c*)] will cause confusion for practitioners"

That isn't precisely what I'm saying, but I can count 3 people who have raised the concern:

Myself
@gibson042 in #33 (comment)
Felienne when this topic was raised in the TC39 Research Call

Shane F. Carr · Answer 27 · Tue Sep 21 2021 13:06:23 GMT+0800 (China Standard Time)

Also, it is misrepresenting my position to say that I believe that "[(a*|b*|c*)] will cause confusion for practitioners". I am raising the hypothesis that it might cause confusion, a hypothesis which is based on anecdotal evidence.

I am more than happy to debate the merits of the hypothesis.

Mathias Bynens · Answer 28 · Tue Sep 21 2021 13:16:22 GMT+0800 (China Standard Time)

I would really like to unblock progress on this.

Since {this|syntax} doesn’t address Shane’s concerns, but \q{this|syntax} with the explicit prefix does, let’s just go with that? I’ve always liked this syntax (not “awkward” at all IMHO), it matches UTS#18, and deciding on this option would avoid the need for folks to invest time researching alternatives when a clear precedent already exists.

Mickey Rose · Answer 29 · Tue Sep 21 2021 16:57:10 GMT+0800 (China Standard Time)

@markusicu

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes.

That is true. However,

Inside character classes, they are currently all just literal characters.

This is slightly incorrect. Curly braces are kind of supervillain characters — they have 3 different meanings outside character class (literal /{x}/, quantifier /x{5}/, escape sequence delimiters /\u{7B}\p{L}/u), and 2 inside character class (literal /[{}]/, escape /[\u{7B}\p{L}]/u).

Parentheses, on the other hand, were always literal inside character class.

Mark Davis · Answer 30 · Tue Sep 21 2021 22:13:20 GMT+0800 (China Standard Time)

I'm ok with that.

Markus Scherer · Answer 31 · Wed Sep 22 2021 02:37:07 GMT+0800 (China Standard Time)

Looks like we might be converging on \q{this|syntax}.
I am looking forward to settling this in our meeting this Thursday.

Note

In the draft spec changes so far, inside character classes, we require escaping () so that we can use them for string literals, and we require escaping {}: “reserved for future extensions, and for readability”.

If we go back to \q{string|literals} then we need not require escaping {} except inside string literals.

Follow-up questions

Should we keep requiring escaping () as well as {} “for future extensions, and for readability”? Or should we limit future extensions in exchange for more literal punctuation characters?

We could keep requiring escaping now, and we could stop requiring escaping later if practitioners complain. (“Old” /v expressions would continue to be valid.)

If we didn't require escaping {} outside of string literals, we would have different escaping rules in string literals vs. elsewhere in character classes. That might argue for requiring escaping them. On the other hand, if we allowed them as literals, then we could also revisit the requirement to escape | outside of string literals. Currently we require that “only” for consistent escaping.

Shane F. Carr · Answer 32 · Wed Sep 22 2021 06:06:15 GMT+0800 (China Standard Time)

The status quo is [(a|b|c)]. Changing it to [\q{a|b|c}] means that we think there's a problem with the status quo. I am not confident enough in my hypothesis to suggest that we change.

Mark Davis · Answer 33 · Wed Sep 22 2021 07:26:54 GMT+0800 (China Standard Time)

We have heard a number of voices that the status quo is not acceptable, if it includes requiring escapes for characters like *. Going back to \q addresses that issue.

Mark Davis · Answer 34 · Wed Sep 22 2021 07:31:12 GMT+0800 (China Standard Time)

On the escaping if we go back to \q{...}. I suggest that:

Outside of the \q we not require escaping for (, ), {, or }.
Inside of the \q we require escaping } and |, but not (, ), or {.

Waldemar Horwat · Answer 35 · Wed Sep 22 2021 08:17:49 GMT+0800 (China Standard Time)

I prefer the status quo of [(a|b|c)] but don't feel particularly strongly about it.

I don't want to add more contexts where { can be used freely but } must be escaped. Those invariably eventually cause trouble as we've learned with [ab[cd]ef] in non-unicode mode. Either require both { and } to be escaped or neither.

Shane F. Carr · Answer 36 · Wed Sep 22 2021 08:41:04 GMT+0800 (China Standard Time)

I am one of the original advocates for [(a|b|c)] exactly because it allows practitioners to use an existing frame of reference. \q may be a regression.

In other words, if the hypothesis is invalid, I would prefer sticking with [(a|b|c)] with no extra escapes over \q.

I'm dissatisfied with the dismissal of [(a|b|c)] with extra escapes, but I don't have enough will power to continue advocating for that option.

Markus Scherer · Answer 37 · Thu Sep 23 2021 00:39:27 GMT+0800 (China Standard Time)

Looks like we are still struggling to settle this via comments. Meeting tomorrow.

Extra escapes

Sounds like we won't require escaping more characters like * and ? inside character classes, strings or not.

String literal syntax

Shane advocated for (string|literals) because that looked familiar but then had concerns that it looked too familiar, and in the discussion we came up with additional ways that it's something totally different from pipe / pipe+parentheses outside of character classes.

Looks like everyone can live with (string|literals) or \q{string|literals}.

Mark and Mathias most recently lobbied for \q{string|literals} -- for clarity, and to address the concerns with parentheses.

Consistent escaping and future extensions

We need to decide, for inside character classes,

whether we want to require escaping the same set of characters inside string literals and elsewhere in character classes
whether we want to require escaping () and {} in order to reserve them for future extensions (we have had this in the draft for a long time)

Markus Scherer · Answer 38 · Fri Sep 24 2021 01:20:12 GMT+0800 (China Standard Time)

Meeting today with Richard, Mathias, Mark, and myself:

We agreed on no extra escapes for “Only in SyntaxCharacter: ^ $ . * + ?”. This is the same as in the draft spec changes so far. (That quote is from the ClassSyntaxCharacter production there.)
We agreed on conservative escaping of parentheses and curly braces, for future extensions, even if we don't use one or both for string literals. This is the same as in the draft spec changes so far. If there is push-back on this, then we can drop required escaping except for whatever string literal syntax we end up with. We could even drop the requirement after standardization.
We did not make a decision on the string literal syntax, waiting for Shane. In discussion, we are leaning towards \q{}.
a. We like that \q{} signals that “something is very different here”, and
b. we like that \q{} does not add a requirement for more escaping. (But now that I think about it, we would still need to require escaping } and so we should also require escaping { -- at least inside of string literals, and for clarity and consistency in character classes in general.)

Shane F. Carr · Answer 39 · Fri Sep 24 2021 03:24:18 GMT+0800 (China Standard Time)

I would like to resolve #46 first.

If our vision is for sets of strings to become more expressive, then we should use (). If our vision is for sets of strings to be dumb, then we should use \q{}.

So, the following options are all okay with me:

Adopt the extended syntax in issue 46 with ()
Declare issue 46 out of scope for now, use (), and reserve the syntax characters for future use
Declare issue 46 out of scope for now, and use \q{} with () reserved for future use

The following options are not okay with me without further research:

Declare issue 46 out of scope for now, use (), and don't reserve the syntax characters for future use
Declare issue 46 out of scope for now, and use \q{} without reserving () for future use

Markus Scherer · Answer 40 · Fri Sep 24 2021 04:15:35 GMT+0800 (China Standard Time)

If our vision is for sets of strings to become more expressive, then we should use (). If our vision is for sets of strings to be dumb, then we should use \q{}.

I don't see how one informs the other very much. String literals with wildcards could be done either way.

If anything, the ideas for wildcards are likely to end up with string literals being yet more different from expressions outside of character classes (some stuff similar, but much different), so probably actually better not to use ().

So, the following options are all okay with me:

Adopt the extended syntax in issue 46 with ()

I think that this could easily take a couple of months of tossing around ideas for syntax and semantics of wildcards. I don't want to delay our proposal by that much.

Declare issue 46 out of scope for now, use (), and reserve the syntax characters for future use

Declare issue 46 out of scope for now, and use \q{} with () reserved for future use

I was skeptical about requiring more escapes based on the hypothesis that practitioners might be confused.
However, if there are plausible ideas for a future extension of fancy string literals, then I might be ok with requiring additional escaping just inside string literals.
So I would be ok with these options. Leaning towards the third one -- \q{} with more escaping inside.

Mark Davis · Answer 41 · Fri Sep 24 2021 04:15:55 GMT+0800 (China Standard Time)

I favor at this point not expanding the scope, and instead:

3.1. Declare issue 46 out of scope for now, and use \q{...}. If and when we ever want to do something along the lines of #46, we can handle it by having a new introducer for strings with fancy syntax: \δ{...}, where δ is a suitable available ASCII letter.

Mathias Bynens · Answer 42 · Fri Sep 24 2021 13:31:30 GMT+0800 (China Standard Time)

I favor at this point not expanding the scope, and instead:

3.1. Declare issue 46 out of scope for now, and use \q{...}. If and when we ever want to do something along the lines of #46, we can handle it by having a new introducer for strings with fancy syntax: \δ{...}, where δ is a suitable available ASCII letter.

This is my preference as well.

Shane F. Carr · Answer 43 · Fri Sep 24 2021 16:01:08 GMT+0800 (China Standard Time)

I continue to believe that () is an intuitive way for practitioners to write sets of strings, and if we adopt something like #46 in the future, then I would like to use () for that purpose. I therefore continue to stand by my position in option 3 that using \q{} for "dumb" sets of strings in this proposal should be predicated on reserving () for future use.

Markus Scherer · Answer 44 · Fri Sep 24 2021 23:25:19 GMT+0800 (China Standard Time)

The discussion here and in 46 is pushing me away from () and over to \q{} precisely because that lets us define one simple syntax now, and later simply use a new prefix (backslash-new-letter) for something fancier.

Markus Scherer · Answer 45 · Fri Oct 01 2021 06:42:22 GMT+0800 (China Standard Time)

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to go back to \q{string|literal|syntax}, not require escaping more characters, and keep consistent escaping inside vs. outside of string literals.