tc39 / ecma262

Status, process, and documents for ECMA-262

Home Page:https://tc39.es/ecma262/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Annex B: Specify non-standard RegExp static properties

claudepache opened this issue · comments

Update: Link to proposal: https://github.com/tc39/proposal-regexp-legacy-features


All web browsers expose non-standard properties on the RegExp constructor, that are updated each time a match is done.

The list of properties that are exposed by all common browsers are the following. They typically expose other properties not mentioned here, but not shared by all implementations:

$1, $2, $3, $4, $5, $6, $7, $8, $9,

and (in pairs, where the first and the second one return the same value):

input, $_
lastMatch, $&
lastParen, $+
leftContext, $`
rightContext, $'

Some observations:

  • their value is always a string;
  • some implementations expose them as data properties, others as accessor properties;
  • those properties are non-enumerable;
  • they cannot be deleted or modified manually.

It would be useful to specify them. I think that the following should be required:

  1. Those properties should be accessors. That would be more honest about the changing nature of their value;
  2. They should be deletable. Indeed, they represent global communication channels that may be unwanted in some environments;
  3. It should be specified when they are updated, presumably sometime during the execution of the builtin RegExp#exec method. This is useful if we want, in some future, separate the part of RegExp#exec that does the real work, and the part that does stateful updates on various objects.

After that, the only remaining non-standard-but-widely-supported JS feature involving regular expressions is their support for octal escapes, even in strict mode code:

(function() {
  'use strict';
  console.log(/\123/.test('S'));
}());

https://javascript.spec.whatwg.org/#octal-escapes-in-regular-expression-literals

After that, the only remaining non-standard-but-widely-supported JS feature is octal escapes in regular expressions:

I think those are already specified in B.1.4 (search for "LegacyOctalEscapeSequence")

AFAIK re.test(str) also changes them, as well as str.split(re) or even a more classic replace so while I agree about having them as configurable accessors, I think RegExp.reset() to clean up all leaks at once would make way more sense, otherwise devs might feel safe to delete RegExp.$1 forgetting to delete RegExp['$&'] too or others.

AFAIK re.test(str) also changes them, as well as str.split(re) or even a more classic replace

In ES2015, all RegExp methods and all String methods that work with regexp have been rewritten to call internally Regexp#exec, so that interesting stuff is happening in only one place.

I think RegExp.reset() to clean up all leaks at once would make way more sense, otherwise devs might feel safe to delete RegExp.$1 forgetting to delete RegExp['$&'] too or others.

The complete list of offending properties may be disclosed with Object.getOwnProperties(RegExp). Sure, a RemoveAllWarts() method might be considered, but new API should not be proposed here (that should be a separate discussion starting on es-discuss according to CONTRIBUTING.md#Feature Requests).

Please also note there's interesting variations in all these when you have more than 10 matches. For example, in IE, lastParen breaks with more than 11 capturing groups.

You're also omitting the multiline / $* pair, which is in every browser except (I think) IE 9-11.

You're also omitting the multiline / $* pair, which is in every browser except (I think) IE 9-11.

Yes, I've omitted them because I've noticed that they're absent from IE11 and I couldn't check Edge.

Could you confirm that they're present in Edge?

(Also, just learned what it means: https://twitter.com/bterlson/status/563540725770375168, ugh; I'd prefer to kill it.)

(Also, just learned what it means: https://twitter.com/bterlson/status/563540725770375168, ugh; I'd prefer to kill it.)

Well, just tested it:

(function() { 
    RegExp.multiline = true; 
    if (!RegExp.multiline) 
        return 'RegExp.multiline cannot be set to true...';
    if (/foo/.multiline)
        return "RegExp.multiline forces multiline flag on new RegExp :-(";
    else
        return "RegExp.multiline has no effect :-)";
})()

Safari and Chrome seem to ignore that misfeature, let's kill it.

https://code.google.com/p/v8/issues/detail?id=3870#c11 << these probably do need to be annex-B-ified, since apparently Edge wasn't able to get rid of them :(

Indeed. @goyakin is working on a complete list which includes most (maybe all?) of those things.

https://code.google.com/p/v8/issues/detail?id=3870#c11 << these probably do need to be annex-B-ified, since apparently Edge wasn't able to get rid of them :(

Yes, that corresponds to the list of properties I've given in the first comment (obtained as the intersection of the results of Object.getOwnPropertyNames(RegExp) among various browsers).

Note sure whether this was already covered somewhere above, but I would be great if the Annex B specification of the RefExp static properties was specified as only happening for direct instances of RegExp.

In other words, RegExp subclass should not get these static own properties updated and then this abomination goes away if you use a RegExp subclass. Also, the the static accessor properties for them on RegExp should have an internal check to make sure that the this value is %RegExp%.

Then devs can do:

class RegExpNoStatics extends RegExp {};  //really, that's all there is to it

match = RegExpNoStatics(/somePattern/).exec(aString); //no global state will be modified

I’ve written a draft here: https://github.com/claudepache/es-regexp-legacy-static-properties

It implements the suggestion of @allenwb in previous comment. The alternative design is to update the values returned by the %RegExp% static properties also for proper subclasses of RegExp. It is, I think, the only design decision that is really debatable.

Another open design decision (related to the [[LegacyConstructor]] internal slot from @claudepache's draft): V8 seems to store the non-standard properties in current Realm whereas SpiderMonkey uses the realm of the RegExp object instance. What do other browsers do?

@anba does this codepen show what you're talking about? http://codepen.io/anon/pen/RWqpza I think not because everyone that I see prints "Origin realm".

@bterlson If you replace re.exec('ttttt'); with RegExp.prototype.exec.call(re, 'ttttt');, Chrome displays "Current realm" whereas Firefox displays "Origin realm".

@anba awesome, thanks, updated the pen.

And, Edge matches Chrome (ie. properties are updated on the %RegExp% from the exec method's realm not the instance's realm). I guess I'd argue for these semantics to be codified on a "majority rules" basis :)

@claudepache

Note that Claude's proposal cases the legacy static property to all answer the empty string when invoked upon a RegExp subclass constructor.

I think it would be better if they answered undefined in those cases (I could also argue for throwing a TypeError, but I prefer undefined so that, from a subclass class perspective, it looks more like those methods simply aren't there.)

This can be specified by making the first line of the algorithm of the getter function of each of those methods:

  1. If SameValue(this value, %RegExp%) is false, return undefined.

@allenwb

Note that Claude's proposal cases the legacy static property to all answer the empty string when invoked upon a RegExp subclass constructor.

Worse: They return answers bound to the last match made with a direct instance of RegExp.

I think it would be better if they answered undefined in those cases (I could also argue for throwing a TypeError, but I prefer undefined so that, from a subclass class perspective, it looks more like those methods simply aren't there.)

I just didn’t try to be too smart. I'm not sure what degree of smartness to include in those inherently unsound methods. I’ve updated my proposal according to your suggestion, because it is more consistent with the "RegExp subclasses don’t use those properties" semantics.

V8 seems to store the non-standard properties in current Realm whereas SpiderMonkey uses the realm of the RegExp object instance.

I’ve spontaneously specced what appears to be SpiderMonkey’s semantics, because it is what made sense to me. I would change for the other way if there is consensus for that.

Why would it make more sense to save it in the object's realm?

Because all the real work is made by the [[RegExpMatcher]] internal slot of that object, and because the lastIndex property of that object is updated. It was a philosophical reason, not a practical one.

crossOriginRE.exec(...);
RegExp.lastParen === ???

The "current realm" is the realm of crossOriginRE.exec (rather than the realm of its caller), which coincides with the realm of crossOriginRE, so that the code is broken anyway. It would make a difference if you try RegExp.prototype.exec.call(crossOriginRE, ...) instead.

Re: behaviour on RegExp subclass constructors.
@allenwb wrote:

I think it would be better if they answered undefined in those cases (I could also argue for throwing a TypeError, but I prefer undefined so that, from a subclass class perspective, it looks more like those methods simply aren't there.)

After reflection, I think that throwing a TypeError rather than returning undefined would be both more consistent with the rest of the language (arguments.caller in strict-mode functions, Regexp#sticky on non-regexpes, etc.), and more useful for programmers (strong signal that the feature is not available, preventing latent bugs).

Other opinions?

Re: RegExp.prototype.exec.call(otherFrame_regexp)

V8 seems to store the non-standard properties in current Realm whereas SpiderMonkey uses the realm of the RegExp object instance.

A third option would be to disable the update of non-standard properties when Realms don't match, the same way it is disabled for instances of proper subclasses of RegExp. What might make sense: if somehow I manage to hack regexpes and/or RegExp.prototype.exec in my frame so that it avoids to litter the RegExp constructor, I might not want my hack to be defeated when another frame plays with my regexp.

As a side note, I think that it might be worthwhile to also neuter RegExp.prototype.compile for proper subclasses of RegExp: if the subclass constructor do non-trivial things, RegExp.prototype.compile might not do what you want.

I've updated my draft in that sense. I will eventually write a motivation for disabling legacy unsound features for non-direct instances of RegExp.

I will eventually write a motivation for disabling legacy unsound features for non-direct instances of RegExp.

Done.

The proposal diverges from browser implementations in a couple of points and, I think, should be updated to match them.

The properties aren't enumerable in the proposal, whereas most of them actually are on Chrome, Firefox, and Edge. These enumerable properties are:

  • input
  • lastMatch
  • lastParen
  • leftContext
  • rightContext
  • $1, ..., $9

The alias properties (e.g., $_ and $&) match the browsers though.

Another thing is that the setter of input (and of its alias $_) is undefined in the proposal. However, input is modifiable on Chrome, Firefox, and Edge, and it returns the string version of the value it is assigned to, e.g.

RegExp.input = 123
// RegExp.input === '123'

There seems to be disagreement among browsers regarding the setters of other properties. Firefox matches the proposal and has them as undefined. Chrome and Edge, on the other hand, implement them as no-op operations. Given that an exception would be thrown when these properties are assigned to in strict mode, I think it would be good to have them as no-op operations in the spec.

@goyakin

The properties aren't enumerable in the proposal, whereas most of them actually are on Chrome, Firefox, and Edge.

corrected.

Another thing is that the setter of input (and of its alias $_) is undefined in the proposal. However, input is modifiable on Chrome, Firefox, and Edge, and it returns the string version of the value it is assigned to,

corrected.

There seems to be disagreement among browsers regarding the setters of other properties. Firefox matches the proposal and has them as undefined. Chrome and Edge, on the other hand, implement them as no-op operations. Given that an exception would be thrown when these properties are assigned to in strict mode, I think it would be good to have them as no-op operations in the spec.

I disagree: silent failure is a bad thing; throwing an error (as do Firefox and Safari in strict mode) is better for preventing latent bugs.

https://codereview.chromium.org/1416093006 seems to have made these accessors non-enumerable, so at least in Chrome, for the time being, they match the original proposal (re: enumerability)

If it doesn't break anything, it might be considered safe to change it to make them non-enumerable (does yangguo have a github account? I'd ping him for comment if I knew)

https://codereview.chromium.org/1416093006 seems to have made these accessors non-enumerable, so at least in Chrome, for the time being, they match the original proposal (re: enumerability)

If it doesn't break anything, it might be considered safe to change it to make them non-enumerable

Noted for future consideration when it'll reach stable release: tc39/proposal-regexp-legacy-features#1

As my draft goes beyond the bare description of legacy RegExp properties, by defining a mechanism that deactivate them for e.g. subclasses, I suppose it should be pushed as proposal.

At whatwg/javascript#29 (comment) I write

The RegExp statics are mutable global state and a global communications channel.
https://github.com/google/caja/wiki/RegexpsLeakMatchGlobally
Their existence must not be normative, so that a system (like SES's repairES5) that removes them produces a state that is still considered a conformant ES implementation.

Regarding the RegExp statics, the important issue is
http://wiki.ecmascript.org/doku.php?id=conventions:make_non-standard_properties_configurable
which we should propose and make normative. I just verified that Chrome 48.0.2564.103 ships with these statics configurable, and actually deletable, demonstrating that it is web compatible to require this.

Any objection to standardizing the Chrome 48+ semantics?

At I write at whatwg/javascript#29 (comment)

I am certainly against specifying these as normative. As I say above:

Their existence must not be normative, so that a system (like SES's repairES5) that removes them
produces a state that is still considered a conformant ES implementation.
However, in light of Claude's point at tc39/proposal-regexp-legacy-features#3 and my reply at tc39/proposal-regexp-legacy-features#3 (comment) , if inter-realm leakage were also banned, I would see benefit to these being normative-optional rather than continuing as unspecified. i.e., where there is no normative requirement that they be present, but there is a normative requirement that, if they are present, then

  • they must actually be deletable. If deleted, they do not magically reappear.
  • they must not leak info cross realm.

With these, yes, I think I would support normative-optional.

Since this is now a stage 3 proposal, should this issue be closed?

Maybe just close it as part of the commit that merges that into master? That'd be cleaner.

Sounds good, although that's an unspecified time in the future :-)

Closing the issue SGTM