tc39 / proposal-regexp-modifiers

Regular Expression Pattern Modifiers for ECMAScript

Home Page:https://arai-a.github.io/ecma262-compare/?pr=3221

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Supported Modifier Flags

rbuckton opened this issue · comments

In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.

The flags currently under consideration are:

  • i — ignore-case
    • Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
    • Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
      const re = /^[A-Z](?i)[a-z']+$/;
      re.test("O'Neill"); // true
      re.test("o'neill"); // false
      
      // alternatively (defaulting to ignore-case):
      const re2 = /^(?-i:[A-Z])[a-z']+$/i;
    • Example — Match word starting with D followed by word starting with D or d (from .NET documentation, see 1)
      const re = /\b(D\w+)(?ix)\s(d\w+)\b/g;
      const input = "double dare double Double a Drooling dog The Dreaded Deep";
      re.exec(input); // ["Drooling dog", "Drooling", "dog"]
      re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
  • m — multiline
    • Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
    • Example — Match a frontmatter block at the start of a file
      const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/;
      re.test("---a"); // false
      re.test("---\n---"); // true
      re.test("---\na: b\n---"); // true
  • s — dot-all (i.e., "single line")
    • Rationale — Control over . matching semantics within a pattern.
    • Example
      const re = /a.c(?s:.)*x.z/;
      re.test("a\ncx\nz"); // flse
      re.test("abcdxyz"); // true
      re.test("aBc\nxYz"); // true
  • x — Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode
    • Rationale — Would allow control over significant whitespace handling in a pattern.
    • Example — Disabling x mode when composing a complex pattern:
      const idPattern = `[a-z]{2} \d{4}`; // space required
      const re = new RegExp(String.raw`
        # match the id
        (?<id>(?-x:${idPattern}))
        
        # match a separator
        :\s
        
        # match the value
        (?<value>\w+)
      `, "x");
      
      re.exec("aa0123: foo")?.groups; // undefined
      re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }

Flags likely too complex to support:

  • u — Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.
  • v — Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of the u flag and would have the same difficulties.

Flags that will never be supported:

  • g — Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • y — Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • d — Indices. This flag affects the match result. Changing it mid pattern would have no effect.

Footnotes

  1. https://docs.microsoft.com/en-us/dotnet/standard/base-types/miscellaneous-constructs-in-regular-expressions#inline-options

For the examples, can you share how you'd do it without the relevant proposal?

i

Simple cases like /[A-Z][A-Za-z]/ are trivial:

// match an uppercase ASCII letter followed by a mixed-case ASCII letter

// with 'i' modifier:
/[A-Z](?i)[A-Z]/

// without 'i' modifier:
/[A-Z][A-Za-z]/

However, more complex cases are far from trivial:

// match a mixed case "hello" followed by the exact characters "World"

// with 'i' modifier:
/(?i:hello) World/

// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:

// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/

// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu

// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/

// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s

// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

There's nothing with [^\s\S] for the dotAll case?

I'm not sure I understand what you mean. Can you clarify?

If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.

I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.

- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!

This works for both u and non-u mode.

The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.