Supported Modifier Flags

Question

Supported Modifier Flags

rbuckton opened this issue 3 years ago · comments

In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.

The flags currently under consideration are:

i — ignore-case
- Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
- Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
```
const re = /^[A-Z](?i)[a-z']+$/;
re.test("O'Neill"); // true
re.test("o'neill"); // false

// alternatively (defaulting to ignore-case):
const re2 = /^(?-i:[A-Z])[a-z']+$/i;
```
- Example — Match word starting with D followed by word starting with D or d (from .NET documentation, see ¹)
```
const re = /\b(D\w+)(?ix)\s(d\w+)\b/g;
const input = "double dare double Double a Drooling dog The Dreaded Deep";
re.exec(input); // ["Drooling dog", "Drooling", "dog"]
re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
```
m — multiline
- Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
- Example — Match a frontmatter block at the start of a file
```
const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/;
re.test("---a"); // false
re.test("---\n---"); // true
re.test("---\na: b\n---"); // true
```

s — dot-all (i.e., "single line")

Rationale — Control over . matching semantics within a pattern.

Example

const re = /a.c(?s:.)*x.z/;
re.test("a\ncx\nz"); // flse
re.test("abcdxyz"); // true
re.test("aBc\nxYz"); // true

x — Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode

Rationale — Would allow control over significant whitespace handling in a pattern.

Example — Disabling x mode when composing a complex pattern:

const idPattern = `[a-z]{2} \d{4}`; // space required
const re = new RegExp(String.raw`
  # match the id
  (?<id>(?-x:${idPattern}))
  
  # match a separator
  :\s
  
  # match the value
  (?<value>\w+)
`, "x");

re.exec("aa0123: foo")?.groups; // undefined
re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }

Flags likely too complex to support:

u — Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.
v — Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of the u flag and would have the same difficulties.

Flags that will never be supported:

g — Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
y — Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
d — Indices. This flag affects the match result. Changing it mid pattern would have no effect.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/miscellaneous-constructs-in-regular-expressions#inline-options ↩

Jordan Harband · Answer 1 · Tue Nov 23 2021 04:43:57 GMT+0800 (China Standard Time)

For the examples, can you share how you'd do it without the relevant proposal?

Ron Buckton · Answer 2 · Tue Nov 23 2021 09:28:31 GMT+0800 (China Standard Time)

`i`

Simple cases like /[A-Z][A-Za-z]/ are trivial:

// match an uppercase ASCII letter followed by a mixed-case ASCII letter

// with 'i' modifier:
/[A-Z](?i)[A-Z]/

// without 'i' modifier:
/[A-Z][A-Za-z]/

However, more complex cases are far from trivial:

// match a mixed case "hello" followed by the exact characters "World"

// with 'i' modifier:
/(?i:hello) World/

// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/

`m`

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:

// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/

// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu

// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode

`s`

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/

// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s

// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

Jordan Harband · Answer 3 · Tue Nov 23 2021 11:30:51 GMT+0800 (China Standard Time)

There's nothing with [^\s\S] for the dotAll case?

Ron Buckton · Answer 4 · Tue Nov 23 2021 13:27:35 GMT+0800 (China Standard Time)

I'm not sure I understand what you mean. Can you clarify?

Ron Buckton · Answer 5 · Tue Nov 23 2021 13:29:05 GMT+0800 (China Standard Time)

If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.

Michael Schmidt · Answer 6 · Wed Mar 16 2022 21:07:25 GMT+0800 (China Standard Time)

I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.

- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!

This works for both u and non-u mode.

Ron Buckton · Answer 7 · Wed Jun 08 2022 04:13:52 GMT+0800 (China Standard Time)

The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.

Supported Modifier Flags

Footnotes

i

m

s

`i`

`m`

`s`