tc39 / proposal-regexp-v-flag

UTS18 set notation in regular expressions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Re. note in the README about user the friendliness of RegExp syntax

pygy opened this issue · comments

I know this is pretty late to change minds that have at possibly been long set at this point, but I took a hiatus from coding for a while, and just discovered this proposal, whose syntax additions are non-trivial, to say the least.

I'd like to react to this remark in the README, which IMO misses the forest for the tree:

However, we found that this is not very developer-friendly.

In particular, one would have to write the prefix and use the u flag. Waldemar pointed out that the prefix looks like it should be enough, and therefore a developer may well accidentally omit adding the u flag. Although this aspect could be addressed by using a more complicated prefix that is currently invalid with and without the u flag (like (?[), doing so would come at the cost of readability.

Also, the use of a backslash-letter prefix would want to enclose the new syntax in {curly braces} because other such syntax (\p{property}, \u{12345}, …) uses curly braces – but not using [square brackets] for the outermost level of a character class looks strange.

Finally, when an expression has several new-syntax character classes, the prefix would have to be used on each one, which is clunky.

My main concerns with this are

  1. the updates to the escaping rules, which are already non-trivial in the post u flag world. By optimizing the syntax for this feature, you're defacing the rest of it by adding new, even more complex escaping rules.
  2. RegExp syntax is not user friendly to begin with, and whatever you do while extending the syntax, it will not be user friendly.

1. Escaping rules are already baroque in u regexps

Currently, /[[]/u and /[\-]/u are valid RegExps, while /[]]/u and /\-/u aren't.

I understand the rationale for reserving escape sequences involving letters for future extensions, but the sigil escaping rules are arbitrary and must be memorized, which is not trivial, given how esoteric RegExps are as a language. Escape rules for non-unicode RegExps were much more beginner or occasionnal user friendly, and most JS users are not RegExps experts.

The more syntax you add, the more complex the escaping rules become, and the more expert one has to be to use the new features (without resorting to trial and error until the JS parser accepts your input).

Also, generally, more escaping makes RegExp even more undreadable.

\op{...}, for set operations would leave the escaping rules alone.

2. RegExp syntax is atrocious.

New RegExp features should not rely on its syntax as the primary way to use them.

RegExp syntax was devised by Kleene to describe the regular formalism (where sub-parsers were abstracted as capital letters e.g. A|B representing an abstract disjunction) and adopted as a write only language for text searching in QED by Thompson. The syntax was never meant for what we use it for today, and it fails abysmally for large expressions. Debugging and refactoring them is very painful as you are all aware.

While some syntax would be useful for serialization and re-parsing, we would be better off with a JS API that lets user combine or diff character sets, and a new RegExp syntax that lets one splice regexps together.

const latinUC = RegExp.intersect(/\p{Uppercase}/u, /\p{sc=Latin}/u)

const latinWord = /\b(?>latinUC)+\b/u

// aternatively
const {sequence, Quantifier} = RegExp
const oneOrMore = new Quantifier("+")

const lw = sequence(/\b/, oneOrMore(latinUC), /\b/)  // the `u` flag is contagious

// => /\b\op{\p{Uppercase}&&\p{sc=Latin}}+/b/u

This would let users re-use sub-expressions, test them individually, and globally write better code that takes advantage of the world-class engine that Irregexp is.

I've opened a discussion about composition here: https://es.discourse.group/t/regexp-composition/1278

It would be possible, as a separate proposal, to have a parallel, procedural way of constructing regexes.
For this proposal here, the committee seems comfortable with extending and modifying the character class syntax, and has let us progress to stage 3.
Thanks for picking up the discussion on discourse.