Remove bounds checks in Lexer by padding source

Question

Remove bounds checks in Lexer by padding source

overlookmotel opened this issue 7 months ago · comments

This came up on thread about SIMD in the Lexer #2296 (comment), but opening a separate issue for it here, as it could benefit the Lexer's performance, even without SIMD.

Currently bounds checks (and the resulting branches) occur on every single call to Lexer::peek, Lexer::next_eq etc.

The idea is to allow removing all bounds checks by padding the source text with sentinel bytes indicating end of file. Instead of bounds checks, the lexer knows it's reached EOF when the EOF sentinel is found.

As the lexer is constantly checking byte values anyway, this should be more performant than checking byte values and doing bounds checks.

ratel uses this technique (code here and here).

In addition, to make SIMD both faster and easier to implement, could extend this approach and pad the source with SIMD_LANES (16/32/64) bytes. Then a SIMD block read is always valid, so no need to implement scalar fallbacks for towards end of the file.

This has been tried before in OXC and was removed:

#43
#90

It is unclear to me at present if the disadvantages which lead to its removal still apply or not. The new Source API may make it more ergonomic than it was previously. In any case, if it enables SIMD, that may change the cost/benefit calculus.

My questions about this approach are:

Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
Could we add an entry point to the parser which takes a mutable String as source text? If that String already has excess capacity, could apply this approach without the cost of copying the source text.
For immutable &str input, could we use page table tricks like slice_deque does to create an extended buffer, while only copying the last chunk of the source?
Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

The last is my greatest concern. It would be really nice to use the type system to enforce safety.

Boshen · Answer 1 · Fri Feb 09 2024 23:34:34 GMT+0800 (China Standard Time)

Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?

Could we add an entry point to the parser which takes a mutable String as source text? If that String already has excess capacity, could apply this approach without the cost of copying the source text.

There is no way the parser can own the string.

I think we should measure the combined performance of cloning the string + removing all out bound checks. Theoretically it should still be faster.

Boshen · Answer 2 · Fri Feb 09 2024 23:38:15 GMT+0800 (China Standard Time)

Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

I don't think so ... I don't know about you but I'm pretty confident with all the tests + miri + 2 fuzzers.

I think you are essentially writing C at this point :-)

overlookmotel · Answer 3 · Fri Feb 09 2024 23:47:44 GMT+0800 (China Standard Time)

There is no way the parser can own the string.

Does it have to? Can it not borrow a &mut String from the user but "hold on" to that mut reference (the way it currently holds on to an immutable &str) to prevent any changes to the string content afterwards.

I think you are essentially writing C at this point :-)

I really don't want to be C! I have some potential ideas for a safe API, but just not sure if it can be made ergonomic enough.

If anyone else has any ideas on that, I'd love to hear them.

Boshen · Answer 4 · Fri Feb 09 2024 23:58:32 GMT+0800 (China Standard Time)

The source text can come from another thread, e.g. this absurd usage https://github.com/oxc-project/oxc/blob/main/crates/oxc_parser/examples/multi-thread.rs

I think we should measure the combined performance of cloning the string + removing all out bound checks. Theoretically it should still be faster.

Let's try this so we don't break the public APIs on our first try.