tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More syntax characters should be forbidden in ClassSyntaxCharacter

sffc opened this issue · comments

/(a*)/ matches strings with zero or more "a". But currently /[(a*)]/v matches the literal string "a*".

I think we should try to be consistent where possible on the matching behavior of alternations () outside of character classes and sets of strings [()] (ClassStrings) inside of character classes, because wrapping a string alternation with [] should not cause the matching behavior to change in surprising ways. Concretely, I would like us to require escaping of all SyntaxCharacter in ClassSyntaxCharacter or at least in NonEmptyClassString.

Summarizing the position of other champions based on our discussion:

@mathiasbynens has pointed out that the behavior of alternations and ClassStrings already differs in the sense that alternations create capturing groups, but ClassStrings do not.

  • I am not as concerned about that behavior difference because it is easily discoverable and only affects the return value after matching has been performed.

@macchiati has pushed back on requiring more syntax characters to be escaped.

  • I come from the other angle, which is that we should err on the side of requiring escapes. Also, I could see us wanting to make these syntax characters have alternative behavior in the future (see #26).

@markusicu has advocated for keeping the definition of ClassSyntaxCharacter consistent both inside and outside of ClassStrings within the context of a character class. He points out that syntax characters like *, +, ?, etc., are interpreted as literals in character classes already.

  • My preference would be to require them to be escaped everywhere in a character class ([\*(\*)] instead of [*(*)]), but I would be okay with only requiring the escape in ClassStrings ([*(\*)]).

I think trying to have identical syntax inside and outside of character classes at this point muddies the water, and has the potential to cause people to give up on strings inside character classes.

Just because some expression X (eg, /(a*(b+a)|b*a|\p{emoji})/) would work outside of a character class doesn't mean that expression X should work inside a character class (eg, /[(a*(b+a)|b*a|\p{emoji})]/), at least without an extensive analysis of what all the implications are, and whether it would be useful.

There is a non-zero cognitive cost to requiring escapes on characters.

I think trying to have identical syntax inside and outside of character classes at this point muddies the water, and has the potential to cause people to give up on strings inside character classes.

Just because some expression X (eg, /(a*(b+a)|b_a|\p{emoji})/) would work outside of a character class doesn't mean that expression X should work inside a character class (eg, /[(a_(b+a)|b*a|\p{emoji})]/), at least without an extensive analysis of what all the implications are, and whether it would be useful.

I think Shane is lobbying for throwing a syntax error for what looks like match operators but is put inside a character class string, rather than have it be silently accepted as literal string contents.

I myself am skeptical that regex authors would be confused here.

There is a non-zero cognitive cost to requiring escapes on characters.

I agree.

Looking for balance here, hoping for feedback from stage 3 reviewers.

I think Shane is lobbying for throwing a syntax error for what looks like match operators but is put inside a character class string, rather than have it be silently accepted as literal string contents.

Right. From a cognitive burden point of view, I think having something silently accepted that has surprising behavior is far worse than a few extra escape characters. To be clear, the characters we're talking about escaping already need to be escaped outside a character class, and they won't appear in every regular expression, so I am skeptical that there is a measurable cost to require these escapes.

To be clear, the characters we're talking about escaping already need to be escaped outside a character class, and they won't appear in every regular expression, so I am skeptical that there is a measurable cost to require these escapes.

They don't currently need to be escaped inside a character class.

It seems like we have three choices (note, all of this is for inside a character class):

  1. Keep it as is. Your hunch is that regex authors will be confused that * and ? and such become part of string literals rather than working as match operators, because of the (string|literal) syntax with parentheses. I am skeptical about that, since * and ? etc. are just literals in character classes today.
  2. Require escaping all SyntaxCharacters to be escaped anywhere inside a character class. Mark's hunch is that that is too onerous. We are in fact requiring more characters to be escaped than before, but this adds more to that list.
  3. Require escaping all SyntaxCharacters to be escaped only inside a NonEmptyClassString but not elsewhere in a character class. Someone at the TC39 meeting voiced a desire for escaping to be more consistent across different parts of regular expressions, so this would create an inconsistency.

Yes, I think those are the three options.

Let's put it this way. Consider these four "zones": /A(B)[C(D)]/

  • Zone A: Outside of a character class
  • Zone B: Inside an alternation (but outside of a character class)
  • Zone C: Inside of a character class (but not in a NonEmptyClassString)
  • Zone D: Inside a NonEmptyClassString in a character class

We currently have two sets of escaping rules: one applies to zones A and B, and the other applies to zones C and D. The main difference is that *, +, ?, and a few other characters that are syntax in zones A and B are interpreted as literals in zones C and D. This is Markus's Option 1.

I am making the claim that it is surprising that the escaping rules are different between zones B and D, since they both look similar in the regular expression. So the minimal change would be to restrict zone D's escaping rules to be more like A and B (with the addition of |); that's Markus's Option 3.

However, for consistency's sake, the best route may be to unify all four of these zones to the same set of escape rules. This is Markus's Option 2. A sub-question of Option 2 would be whether we should also require | to be escaped in zones A and C, such that the four zones are actually 100% equivalent in their escape rules.

@mathiasbynens has pointed out that the behavior of alternations and ClassStrings already differs in the sense that alternations create capturing groups, but ClassStrings do not.

Another difference is that (x|xy|xyz) matches “longest string first” in ClassStrings, but in source order elsewhere. (I don’t think this is confusing, however — character classes have always been a wildly different context within regular expression patterns.)

I am skeptical of requiring characters that do not have special meaning inside character classes to be escaped there.

This would break some commonly used idioms for no good reason: some folks like to write /foo[.]bar[.]baz/ as a way to escape the .'s, and I see little harm with letting them continue that practice.

@waldemarhorwat Does your comment relate to both zones C and D or only zone C (from #33 (comment))? In other words, would you be okay with requiring escapes within ClassStrings, such as:

/foo[(\.)]bar/v

Why would you want to do that?

Why would you want to do that?

See a few comments up: #33 (comment)

I still don't see the rationale.

Escape syntax currently depends only on whether one is inside or outside []. It doesn't change between inside or outside () — you're using two letters (A and B) to describe what is a single zone as far as escaping *, ., etc. is concerned. I see no reason to introduce artificial syntactic gotchas depending on being inside or outside (). We don't have that distinction in the regex language now and shouldn't add it.

I still don't see the rationale.

Let me just explain where Shane is coming from, since he is out for a bit and I want to make progress on our list of issues. I am not endorsing Shane's suggestion.

Shane is looking at our use of (literal|strings) inside character classes and contends that that looks like an alternation. He thinks that regex authors might try to stick arbitrary regexes into this string literal syntax, for example [(a*)], and expect the class to match what those "expressions" match when they don't get a SyntaxError. So he wants to require escaping SyntaxCharacters at least in string literals.


We had chosen the round-parentheses-with-pipe syntax for string literals deliberately to make it look a little like an alternation, and most of us are not worried about regex authors confusing real alternations outside of character classes with the string literal syntax inside.

I agree with @sffc about the risk here. Introducing structure inside a character class where none existed before will suggest to at least some practitioners that even more metacharacters have special meaning, and absence of syntax errors in such cases will let unintended regular expressions slip by—heck, just lack of sleep would probably be sufficient for me to misinterpret something like [\p{RGI_Emoji}--\p{ASCII}--(...)].

A sub-question of Option 2 would be whether we should also require | to be escaped in zones A and C, such that the four zones are actually 100% equivalent in their escape rules.

| is a metacharacter in zone A, so it already needs to be escaped. Making escaping rules 100% equivalent across the whole regular expression would create the clunkiest regex flavour in existence. Not only you'd have to escape [.?+*{|] in C, you'd also have to escape - in A because it's a metacharacter in C (along with all the set operators).

We had chosen the round-parentheses-with-pipe syntax for string literals deliberately to make it look a little like an alternation

Alternation has nothing to do with parentheses, it's just the pipe. Instead of making parentheses special, you could add another modifier like ^ to change the interpretation of what follows:

[|x|xy|xyz]
// instead of
[(x|xy|xyz)]

The primary point of disagreement on this issue is the premise of the OP, that practitioners may experience unexpected behavior on /[(a*)b*]/. I have therefore scheduled this topic for the next TC39 Research Call on September 9.

We brainstormed about “researching” this topic in the TC39 Research Call last week. The main action item from it was “Team needs to agree on value, path we want to take”

We discussed it this morning in the team meeting.

  • @macchiati feels strongly that this research would not be a good use of time; very difficult to measure, takes a lot of time, need to iterate, select the right people to do the test
  • @gibson042 agreed that this would be difficult to do
  • We think that we should move forward as is, but if there is still a strong concern, then we should fall back to the earlier-proposed {string|literal|syntax} with curly braces – making it again look less like an alternation (and more like in ICU and UTS 18).

I do not agree with the above post from Markus.

  1. We had a very productive meeting last week in the TC39 Research Call. There was strong sentiment from that meeting that this is a very worthwhile topic to pursue.
  2. Although there is some work on our end, we have help for writing the questions, running the survey, and gathering the data.
  3. We should neither close this issue nor make any other changes to the syntax without gathering the data first.

We don’t need additional research to know that

  • there is strong pushback against requiring more escaping (see earlier comments in this thread)
  • we could address Shane’s concerns by reverting to the earlier {string|literal|syntax}

Given that, I would strongly prefer not spending time researching options other than those already on the table:

  1. stick with (string|literal|syntax)
  2. revert to {string|literal|syntax}

My concern is about standardizing a syntax that will have surprising behavior to practitioners reading and writing regular expressions.

My concern applies to both of those syntaxes:

  1. [{a*|b*|c*}] for reasons discussed in #17 (comment): I claim that it is misleading for curly braces to not have a specifier character preceding them.
  2. [(a*|b*|c*)] for reasons discussed in this thread: I claim that it is misleading for parentheses to have different syntax character rules depending on their position in the regular expression

In other words, "reverting" to the curly brace syntax does not address my concern.

My concerns are based on hypotheses. The reason data acquisition is appealing to me is that it would help validate or invalidate my hypotheses.

  1. [(a*|b*|c*)] for reasons discussed in this thread: I claim that it is misleading for parentheses to have different syntax character rules depending on their position in the regular expression

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers (a{3,5}) and for enclosing details of \u, \p, and \P (and elsewhere also \b{g} etc.). Inside character classes, they are currently all just literal characters.

This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes.

PS:

We know that several people really don't want to require more escaping than we need. I have a preference for consistent escaping inside of character classes. But I could live with more escaping inside [(string|literal|syntax)] than in the rest of a character class.

Also, we have settled before on the string literal syntax, but I could live with the more verbose [\q{string|literal|syntax}], with the \q prefix, as suggested in UTS 18. If we did go back to this one, then I think we should not need the additional escaping.

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers (a{3,5}) and for enclosing details of \u, \p, and \P (and elsewhere also \b{g} etc.). Inside character classes, they are currently all just literal characters.

This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes.

Thank you for this comment, which presents a counter-argument for my hypothesis.

People using regex seem to have very little problem with realizing that characters inside a CC and outside a CC are different: that in [a*] the * is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly like [\p{letter}] would be a user error.

Yes, a user error. I firmly believe that an important part of our job as spec authors is to design a syntax that is resistant to user errors.

Given that people don't like excessive escaping, I think the choices at this point are clear

Let's keep all options on the table so that we know clearly what we are working with.

a. stick with (string|literal|syntax)
b. revert to {string|literal|syntax}
c. reverting further to \q{string|literal|syntax}
d. amend (string|literal|syntax) with more escape rules (multiple ways to do that)

My hypothesis is that (a) will cause confusion to practitioners. @markusicu has offered a counter-argument.

My hypothesis is that (b) will also cause confusion to practitioners. @macchiati agrees. I feel more strongly about this hypothesis than the previous one.

I do not perceive substantial risk for (c) causing confusion to practitioners, but it comes with a (fairly small) ergonomics cost.

My hypothesis remains that (d) offers the best balance between ergonomics and understandability.
@macchiati is opposed, based on the assertion that it requires "excessive escaping," hurting ergonomics. I disagree with that assertion since we are talking about only a handful of syntax characters, and those characters are not particularly common (a claim we could quantify if needed).

So I really see two paths forward:

  1. Agree as a champions group that the premise of this issue, the hypothesis that [(a*|b*|c*)] will cause confusion for practitioners, is false, and close the issue.
  2. Agree as a champions group that the premise of the issue is true, and then choose one of the other choices that we have on the table. It seems that (c) is the most likely fallback option.

The reason I suggested bringing this to the TC39 Research Call was to validate the premise of this issue. I do not believe strongly enough in my hypothesis to suggest that we revert to option (c) without seeing additional data.

That isn't what I agreed with (sounds like I wasn't clear).

Sorry for misunderstanding your position.

I only hear one person so far saying that "[(a*|b*|c*)] will cause confusion for practitioners"

That isn't precisely what I'm saying, but I can count 3 people who have raised the concern:

  1. Myself
  2. @gibson042 in #33 (comment)
  3. Felienne when this topic was raised in the TC39 Research Call

Also, it is misrepresenting my position to say that I believe that "[(a*|b*|c*)] will cause confusion for practitioners". I am raising the hypothesis that it might cause confusion, a hypothesis which is based on anecdotal evidence.

I am more than happy to debate the merits of the hypothesis.

I would really like to unblock progress on this.

Since {this|syntax} doesn’t address Shane’s concerns, but \q{this|syntax} with the explicit prefix does, let’s just go with that? I’ve always liked this syntax (not “awkward” at all IMHO), it matches UTS#18, and deciding on this option would avoid the need for folks to invest time researching alternatives when a clear precedent already exists.

@markusicu

Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes.

That is true. However,

Inside character classes, they are currently all just literal characters.

This is slightly incorrect. Curly braces are kind of supervillain characters ­— they have 3 different meanings outside character class (literal /{x}/, quantifier /x{5}/, escape sequence delimiters /\u{7B}\p{L}/u), and 2 inside character class (literal /[{}]/, escape /[\u{7B}\p{L}]/u).

Parentheses, on the other hand, were always literal inside character class.

Looks like we might be converging on \q{this|syntax}.
I am looking forward to settling this in our meeting this Thursday.

Note

In the draft spec changes so far, inside character classes, we require escaping () so that we can use them for string literals, and we require escaping {}: “reserved for future extensions, and for readability”.

If we go back to \q{string|literals} then we need not require escaping {} except inside string literals.

Follow-up questions

Should we keep requiring escaping () as well as {} “for future extensions, and for readability”? Or should we limit future extensions in exchange for more literal punctuation characters?

We could keep requiring escaping now, and we could stop requiring escaping later if practitioners complain. (“Old” /v expressions would continue to be valid.)

If we didn't require escaping {} outside of string literals, we would have different escaping rules in string literals vs. elsewhere in character classes. That might argue for requiring escaping them. On the other hand, if we allowed them as literals, then we could also revisit the requirement to escape | outside of string literals. Currently we require that “only” for consistent escaping.

The status quo is [(a|b|c)]. Changing it to [\q{a|b|c}] means that we think there's a problem with the status quo. I am not confident enough in my hypothesis to suggest that we change.

On the escaping if we go back to \q{...}. I suggest that:

  • Outside of the \q we not require escaping for (, ), {, or }.
  • Inside of the \q we require escaping } and |, but not (, ), or {.

I prefer the status quo of [(a|b|c)] but don't feel particularly strongly about it.

I don't want to add more contexts where { can be used freely but } must be escaped. Those invariably eventually cause trouble as we've learned with [ab[cd]ef] in non-unicode mode. Either require both { and } to be escaped or neither.

I am one of the original advocates for [(a|b|c)] exactly because it allows practitioners to use an existing frame of reference. \q may be a regression.

In other words, if the hypothesis is invalid, I would prefer sticking with [(a|b|c)] with no extra escapes over \q.

I'm dissatisfied with the dismissal of [(a|b|c)] with extra escapes, but I don't have enough will power to continue advocating for that option.

Looks like we are still struggling to settle this via comments. Meeting tomorrow.

Extra escapes

Sounds like we won't require escaping more characters like * and ? inside character classes, strings or not.

String literal syntax

Shane advocated for (string|literals) because that looked familiar but then had concerns that it looked too familiar, and in the discussion we came up with additional ways that it's something totally different from pipe / pipe+parentheses outside of character classes.

Looks like everyone can live with (string|literals) or \q{string|literals}.

Mark and Mathias most recently lobbied for \q{string|literals} -- for clarity, and to address the concerns with parentheses.

Consistent escaping and future extensions

We need to decide, for inside character classes,

  • whether we want to require escaping the same set of characters inside string literals and elsewhere in character classes
  • whether we want to require escaping () and {} in order to reserve them for future extensions (we have had this in the draft for a long time)

Meeting today with Richard, Mathias, Mark, and myself:

  1. We agreed on no extra escapes for “Only in SyntaxCharacter: ^ $ . * + ?”. This is the same as in the draft spec changes so far. (That quote is from the ClassSyntaxCharacter production there.)
  2. We agreed on conservative escaping of parentheses and curly braces, for future extensions, even if we don't use one or both for string literals. This is the same as in the draft spec changes so far. If there is push-back on this, then we can drop required escaping except for whatever string literal syntax we end up with. We could even drop the requirement after standardization.
  3. We did not make a decision on the string literal syntax, waiting for Shane. In discussion, we are leaning towards \q{}.
    a. We like that \q{} signals that “something is very different here”, and
    b. we like that \q{} does not add a requirement for more escaping. (But now that I think about it, we would still need to require escaping } and so we should also require escaping { -- at least inside of string literals, and for clarity and consistency in character classes in general.)

I would like to resolve #46 first.

If our vision is for sets of strings to become more expressive, then we should use (). If our vision is for sets of strings to be dumb, then we should use \q{}.

So, the following options are all okay with me:

  1. Adopt the extended syntax in issue 46 with ()
  2. Declare issue 46 out of scope for now, use (), and reserve the syntax characters for future use
  3. Declare issue 46 out of scope for now, and use \q{} with () reserved for future use

The following options are not okay with me without further research:

  1. Declare issue 46 out of scope for now, use (), and don't reserve the syntax characters for future use
  2. Declare issue 46 out of scope for now, and use \q{} without reserving () for future use

If our vision is for sets of strings to become more expressive, then we should use (). If our vision is for sets of strings to be dumb, then we should use \q{}.

I don't see how one informs the other very much. String literals with wildcards could be done either way.

If anything, the ideas for wildcards are likely to end up with string literals being yet more different from expressions outside of character classes (some stuff similar, but much different), so probably actually better not to use ().

So, the following options are all okay with me:

  1. Adopt the extended syntax in issue 46 with ()

I think that this could easily take a couple of months of tossing around ideas for syntax and semantics of wildcards. I don't want to delay our proposal by that much.

  1. Declare issue 46 out of scope for now, use (), and reserve the syntax characters for future use
  2. Declare issue 46 out of scope for now, and use \q{} with () reserved for future use

I was skeptical about requiring more escapes based on the hypothesis that practitioners might be confused.
However, if there are plausible ideas for a future extension of fancy string literals, then I might be ok with requiring additional escaping just inside string literals.
So I would be ok with these options. Leaning towards the third one -- \q{} with more escaping inside.

I favor at this point not expanding the scope, and instead:

3.1. Declare issue 46 out of scope for now, and use \q{...}. If and when we ever want to do something along the lines of #46, we can handle it by having a new introducer for strings with fancy syntax: \δ{...}, where δ is a suitable available ASCII letter.

I favor at this point not expanding the scope, and instead:

3.1. Declare issue 46 out of scope for now, and use \q{...}. If and when we ever want to do something along the lines of #46, we can handle it by having a new introducer for strings with fancy syntax: \δ{...}, where δ is a suitable available ASCII letter.

This is my preference as well.

I continue to believe that () is an intuitive way for practitioners to write sets of strings, and if we adopt something like #46 in the future, then I would like to use () for that purpose. I therefore continue to stand by my position in option 3 that using \q{} for "dumb" sets of strings in this proposal should be predicated on reserving () for future use.

The discussion here and in 46 is pushing me away from () and over to \q{} precisely because that lets us define one simple syntax now, and later simply use a new prefix (backslash-new-letter) for something fancier.

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to go back to \q{string|literal|syntax}, not require escaping more characters, and keep consistent escaping inside vs. outside of string literals.