Performance implications of line terminator normalization

Question

Performance implications of line terminator normalization

tschneidereit opened this issue 8 years ago · comments

The line terminator normalization is quite unfortunate because it prevents engines from just handing out slices of existing sources, or at least simply memcpying the relevant part of the source. At least in the general case, and not without additional work during source scanning to detect \r characters.

@michaelficarra you introduced the normalization, so maybe you can explain the reasoning? I saw that it was discussed at a tc39 f2f, but unfortunately the notes are very terse, so I really don't understand why it's required.

(There seems to be an additional problem with template strings: it seems unlikely to me that we really want F.p.toString mess with \r characters inside them. Properly handling that would require parsing the source again - or keeping information about template strings around - IINM.)

Michael Ficarra · Answer 1 · Mon Nov 14 2016 01:07:04 GMT+0800 (China Standard Time)

@tschneidereit That's a great question. As you suspect, it does have to do with templates. Templates normalise line terminators (even in the raw value!). See these lines from the spec

The TRV of LineTerminatorSequence::<LF> is the code unit value 0x000A.
The TRV of LineTerminatorSequence::<CR> is the code unit value 0x000A.
The TRV of LineTerminatorSequence::<LS> is the code unit value 0x2028.
The TRV of LineTerminatorSequence::<PS> is the code unit value 0x2029.
The TRV of LineTerminatorSequence::<CR><LF> is the sequence consisting of the code unit value 0x000A.

TRV determines the raw value and TV (which determines the computed value) defers to it for LineTerminatorSequences.

So the normalisation is for consistency with templates and so that a programmer can't observe a change in line terminator encoding (intentional or otherwise). Fortunately, you only need to store pointers into the raw source and can do line terminator normalisation when the toString result is requested. And if it's ever a perf concern, that result can be memoised.

Till Schneidereit · Answer 2 · Mon Nov 14 2016 01:59:42 GMT+0800 (China Standard Time)

Templates normalise line terminators (even in the raw value!)

Oh, I wasn't aware of this. Thanks for the clarification.

So the normalisation is for consistency with templates and so that a programmer can't observe a change in line terminator encoding (intentional or otherwise)

I'm not sure I really buy that: there are various situations in which authors have to deal with this kind of potential inconsistency already, such as loading a script over XHR (or by whatever other means) as a string and the evaluating it. What's special about this situation that it requires the normalization of source text? And why isn't consistency with other ways of getting source text more important?

Fortunately, you only need to store pointers into the raw source and can do line terminator normalisation when the toString result is requested. And if it's ever a perf concern, that result can be memoised.

I'm mostly concerned about frameworks using F.p.toString for reflection purposes in which case it seems likely they'd only do the reflection once (one hopes), but need that to be fast. I also don't think memoization is a good idea here because of the increased memory usage.

Michael Ficarra · Answer 3 · Mon Nov 14 2016 09:37:11 GMT+0800 (China Standard Time)

I'm open to hearing some pros/cons of line terminator normalisation. I also suspect many at TC39 would have strong opinions. For instance, I think it was @allenwb who originally pointed out that we must normalise line terminators. But I'll add an agenda item to the next meeting to give other implementors an opportunity to voice their concerns about it.

I'm mostly concerned about frameworks using F.p.toString for reflection purposes in which case it seems likely they'd only do the reflection once (one hopes), but need that to be fast.

I'm not really concerned about the reflection use cases. TC39 agreed that we should introduce proper APIs for reflecting on function parameter lists. Frameworks that reflect on parameter lists via F.p.toString will continue to work, but ones that want to be fast will move to the new reflection APIs.

edit: See tc39/agendas#219

Jordan Harband · Answer 4 · Mon Nov 14 2016 11:04:28 GMT+0800 (China Standard Time)

One thing worth noting is that not normalizing line terminators might expose implementation-specific information (like, that it's running on windows, eg) - it seems like that might be a security concern.

Till Schneidereit · Answer 5 · Mon Nov 14 2016 20:51:48 GMT+0800 (China Standard Time)

But I'll add an agenda item to the next meeting to give other implementors an opportunity to voice their concerns about it.

Thanks. If the current consensus remains, we'll cope with it of course. I just think it's unfortunate - and exchanges one kind of inconsistency for another, without much gain.

One thing worth noting is that not normalizing line terminators might expose implementation-specific information (like, that it's running on windows, eg) - it seems like that might be a security concern.

Can you elaborate on that? Without normalization we'd just return the (apart from WTF-16 encoding) unmodified source text. What's the information you think that'd leak? The JS engine running on Windows wouldn't change the returned string.

Allen Wirfs-Brock · Answer 6 · Tue Nov 15 2016 00:49:51 GMT+0800 (China Standard Time)

(There seems to be an additional problem with template strings: it seems unlikely to me that we really want F.p.toString mess with \r characters inside them. Properly handling that would require parsing the source again - or keeping information about template strings around - IINM.)

Unescaped line terminators are already normalized by template literals. See note at the end of 11.8.6.1 (or the actual TV/TRV definitions)

John-David Dalton · Answer 7 · Tue Nov 15 2016 00:56:35 GMT+0800 (China Standard Time)

With issues like #16 and this I think this proposal may need some retooling. The current toString behavior has been dealt with for years so adding extra overhead to spec-ify and close-gaps-in the de facto behavior may not be so great.

Till Schneidereit · Answer 8 · Tue Nov 15 2016 00:57:35 GMT+0800 (China Standard Time)

Unescaped line terminators are already normalized by template literals. See note at the end of 11.8.6.1 (or the actual TV/TRV definitions)

Right, @michaelficarra made me aware of that. See the additional discussion above, too.

Michael Ficarra · Answer 9 · Fri Dec 02 2016 02:47:22 GMT+0800 (China Standard Time)

An update after this was discussed yesterday in the committee:

Some committee members very strongly supported the normalisation, going as far to say they would block advancement of this proposal without it.
4 of the present implementers voiced strong reservations about implementing the normalisation.
Legitimate use cases exist for both normalised and non-normalised F.p.toString output.
A possible performant implementation was identified, but it would take a substantial amount of implementation effort. Implementers do not want to do this.
An advanced user of F.p.toString without normalisation can always do the normalisation themselves using .replace(/\r\n?|\n/g, '\n').

Action items:

I will try to get usage data of scripts that use CRLFs and also require performant F.p.toString.
I will request hard data about performance implications of the most performant implementation that implementers are willing to implement.

I will bring this back up with the committee once I feel that we have enough additional input that we can advance the proposal; with or without normalisation.

Jordan Harband · Answer 10 · Tue Jan 10 2017 06:44:10 GMT+0800 (China Standard Time)

@michaelficarra given "Some committee members very strongly supported the normalisation, going as far to say they would block advancement of this proposal without it.", and #22, is the implication that you are confident you can persuade these committee members at the next meeting?

Michael Ficarra · Answer 11 · Tue Jan 10 2017 08:26:46 GMT+0800 (China Standard Time)

@ljharb I plan to proceed with this proposal only in a state in which it will actually be implemented. If some committee members will block it in its current state, I will withdraw the proposal. There's no point in maintaining a specification that is completely disassociated from reality.