Supported Modifier Flags
rbuckton opened this issue · comments
In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.
The flags currently under consideration are:
i
— ignore-case- Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
- Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
const re = /^[A-Z](?i)[a-z']+$/; re.test("O'Neill"); // true re.test("o'neill"); // false // alternatively (defaulting to ignore-case): const re2 = /^(?-i:[A-Z])[a-z']+$/i;
- Example — Match word starting with
D
followed by word starting withD
ord
(from .NET documentation, see 1)const re = /\b(D\w+)(?ix)\s(d\w+)\b/g; const input = "double dare double Double a Drooling dog The Dreaded Deep"; re.exec(input); // ["Drooling dog", "Drooling", "dog"] re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
m
— multiline- Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
- Example — Match a frontmatter block at the start of a file
const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/; re.test("---a"); // false re.test("---\n---"); // true re.test("---\na: b\n---"); // true
s
— dot-all (i.e., "single line")- Rationale — Control over
.
matching semantics within a pattern. - Example
const re = /a.c(?s:.)*x.z/; re.test("a\ncx\nz"); // flse re.test("abcdxyz"); // true re.test("aBc\nxYz"); // true
- Rationale — Control over
x
— Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode- Rationale — Would allow control over significant whitespace handling in a pattern.
- Example — Disabling
x
mode when composing a complex pattern:const idPattern = `[a-z]{2} \d{4}`; // space required const re = new RegExp(String.raw` # match the id (?<id>(?-x:${idPattern})) # match a separator :\s # match the value (?<value>\w+) `, "x"); re.exec("aa0123: foo")?.groups; // undefined re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }
Flags likely too complex to support:
u
— Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.v
— Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of theu
flag and would have the same difficulties.
Flags that will never be supported:
g
— Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.y
— Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.d
— Indices. This flag affects the match result. Changing it mid pattern would have no effect.
Footnotes
For the examples, can you share how you'd do it without the relevant proposal?
i
Simple cases like /[A-Z][A-Za-z]/
are trivial:
// match an uppercase ASCII letter followed by a mixed-case ASCII letter
// with 'i' modifier:
/[A-Z](?i)[A-Z]/
// without 'i' modifier:
/[A-Z][A-Za-z]/
However, more complex cases are far from trivial:
// match a mixed case "hello" followed by the exact characters "World"
// with 'i' modifier:
/(?i:hello) World/
// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/
m
If you are in u
mode, you could emulate non-m
mode when in m
mode using the proposed \A
and \z
buffer boundaries. However, if you are not in u
mode, there's no way to match the buffer boundaries when in m
mode:
// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/
// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu
// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode
s
Its fairly complicated to invert the s
flag in a RegExp without modifiers, and easy to get wrong:
// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/
// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/
// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s
// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s
There's nothing with [^\s\S]
for the dotAll case?
I'm not sure I understand what you mean. Can you clarify?
If you mean using [\s\S]
to match everything, that's feasible for the first s
example, sure. I don't see how it helps with the second example though.
I just want to share a little trick to emulate m
and non-m
mode without using ^
and $
. This might be relevant for transpilers.
- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!
This works for both u
and non-u
mode.
The modifiers supported by this proposal will be limited to i
, m
, and s
. These may be potentially changed by future proposals (such as the x
-mode proposal), but doing so is out of scope.