tc39 / proposal-regexp-unicode-property-escapes

Proposal to add Unicode property escapes `\p{…}` and `\P{…}` to regular expressions in ECMAScript.

Home Page:https://tc39.github.io/proposal-regexp-unicode-property-escapes/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Separator: `=` vs. `:`

mathiasbynens opened this issue · comments

Perl does both:

$ perl -Mutf8 -E 'say "π" =~ /\p{Script=Greek}/'
1

$ perl -Mutf8 -E 'say "π" =~ /\p{Script:Greek}/'
1

We only want to support one, but which one? The current proposal uses =, but why not :?

00:35:54 <bterlson> what is the rationale for `=` in `Script=` btw?
00:37:40 <bterlson> I mean, why not `Script:foo`
00:38:09 <mathiasbynens> no strong preference, but I think we should only support either `:` or `=` but not both: https://github.com/mathiasbynens/es-regexp-unicode-property-escapes#why-not-support--as-a-separator-in-addition-to-
00:38:29 <bterlson> both is absurd
00:38:40 <mathiasbynens> Perl does both!
00:38:46 <bterlson> absurd
00:38:50 <mathiasbynens> :)
00:39:06 <bterlson> `:` aligns with property syntax
00:43:06 <mathiasbynens> hmm yeah that makes sense… although property name grammar in \p{} is much more restrictive than Identifier

: aligns with property syntax, but that’s where the similarity ends — property name/value grammar in \p{…} is much more restrictive than Identifier.

= on the other hand reminds of SQL, where \p{property=value} becomes something like SELECT * FROM symbols WHERE property = 'value';, i.e. match all symbols where the value for property $property is $value. I like the mental model of querying the Unicode Database.

I'd say it's arbitrary. Any separator would do.

I still prefer : slightly as I like to think about it like creating an options bag, but my only strong preference is to not do both.

I prefer = slightly, but that may just be because that's the first syntax I saw @hashseed implement and it looked nice to me.

Time for a twitter poll! :-P

https://twitter.com/bterlson/status/764184006095048704

ECMAScript’s RegExps are learning more about Unicode with the \p proposal. What syntax should it use?

330 votes:

  • 52% /\p{Script:Greek}/
  • 28% /\p{Script=Greek}/
  • 20% Why not both?

This seems possibly confusing as @bmeck points out.

let foo;
`${foo=1}`; // foo = 1
/\p{foo=1}/; // syntax error?

Not really sure why that’s confusing… one is a string template and the other is a regexp literal. Syntax is entirely different…

It's possibly confusing because in order to understand what foo=1 is doing you have to understand that the syntax is entirely different despite looking identical (and even the surrounding syntax is similar what with the curlies and all).

In theory I like : better, but in practice I use and teach = because that's what I see much more frequently in the wild and more regex engines support it. I think of regex as a language of its own embedded within other languages without any syntactic relationship to the languages that embed it. Note that the = in \p{…=…} aligns with the = in (?=…) for positive lookaheads and (?<=…) for positive lookbehinds.

I performed an extremely unscientific survey of my locally checked-out git projects (which of course includes my own code):

$ ack -ch '\\p\{\w+=\w+\}'
814
$ ack -ch '\\p\{\w+:\w+\}'
121

Also the regex docs for Java and ICU only include = as well as the specification for Unicode Sets and their use in the Unicode CLDR data files. Lastly, I've never seen a regex engine that solely supports : but would love to hear about it if anyone knows one.

In theory I like : better […]

Seems like most people feel that way.

@patch makes a very good point in favor of =, though:

I think of regex as a language of its own embedded within other languages without any syntactic relationship to the languages that embed it. Note that the = in \p{…=…} aligns with the = in (?=…) for positive lookaheads and (?<=…) for positive lookbehinds.

I’m slightly leaning towards sticking to = now.

@bterlson What do you think?

I find @patch's arguments the most persuasive so far and am convinced that regexp experts will generally prefer =. I'm not sure JS developers would generally find it more approachable because they may not be regexp experts, have experience with other engines, know much about Unicode, see the correspondence between lookaheads, etc.

I cannot argue strongly in favor of : so I support moving forward with =. The twitter poll is clearly in favor of :, though, fwiw :)

At TC39, we decided to reverse the judgement here and go with :.

I feel like we didn't represent the FAQ entry contents well... do you @littledan? If not maybe we can do a quick re-check?

A quick re-check sounds good! If the decision made in this issue is reversed, I’d love to hear the rationale for it.

OK, I'll see if we have time to discuss this at this TC39 meeting later. The rationale was that = is used for property set, but the examples in the FAQ seem to show that RegExps already assign a new meaning to =.

Cc @allenwb who made the point for : rather than =.