oxc-project / oxc

⚓ A collection of JavaScript tools written in Rust.

Home Page:https://oxc.rs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make `Atom` safe and faster

Boshen opened this issue · comments

Atom had a memory leak which is not desirable: #1803

I removed the memory and miri test to the parser: #2294

The change introduced some performance regressions and a unsafe

// SAFETY: It is unsafe to use this string after the allocator is dropped.
Self(AtomImpl::Arena(unsafe { std::mem::transmute(s) }))

The next goal is to remove this unsafe.

Can I tackle this? I already have a WIP which steals a lot of the efficient code of CompactString. @Boshen I know you're aware of that, but obviously you ran out of patience waiting for me (I don't blame you). I seem to have got side-tracked by trying to optimize the lexer!

But now that the memory leak is solved, perhaps it's a little less urgent?

I want to enable as many good engineering practices as possible so I was eager to make the code memory-leak free so we can turn on miri.

It wasn't about impatience 😅

Sorry, that wasn't criticism or a barbed comment. It was more of an apology that I said I'd do it, and then I haven't. I started hacking on the lexer just to gain familiarity with the codebase, but have ended up going down a performance-chasing rabbit hole! It's too much fun.

Anyway, I think I'm getting towards the end of that work now. #2288 is the major building block for lexer optimization, and I should have the PR which builds on it to speed up lexing identifiers (the one that gives the 10% parser speed up) ready in a few days. So should be able to turn my attention to Atom pretty soon.

That sound OK to you?

That sound OK to you?

I can't wait! But take your time and enjoy the fun!

Instead of exposing a single Atom type, I'm thinking along the lines of:

  • ArenaStr<'a> with the functionaries of inlined string + arena allocated str
  • CompactStr with the functionaries of inlined string + heap allocated str

For maximum performance, ArenaStr<'a> should be the size of u32, with only a pointer and a length (sans capacity).

The ArenaStr<'a> will have to_compact_str and to_string for downstream usages, the user has the choice of choosing the most convenient one for them.

This is really nice - nice combination of efficiency and usability.

Just to check a couple of things:

ArenaStr<'a> is what will be used in the AST? to_compact_str / to_string are for downstream users, if they don't want to deal with lifetimes in their own code, or if they want to mutate the string?

I'm unclear how ArenaStr<'a> can be size of a u32. I was imagining ArenaStr<'a> being 16 bytes:

  • Bytes 0-7: Pointer
  • Bytes 8-11: Length (u32)
  • Bytes 12-15: Unused (though could find uses for this e.g. index into Vec of escaped strings, or flags for "String is in source code" / "String is escaped" / "String is Unicode").

We could probably squeeze it down smaller using pointer compression, but:

  1. That optimization could also apply to Vecs and Boxs in the AST, so I was thinking of it as a broader issue.
  2. 16 is probably a good size for making majority of strings be stored inline (most JS identifiers are under 16 bytes).

Or have I misunderstood, and you have something else in mind?

ArenaStr<'a> is what will be used in the AST?

Yes.

ArenaStr<'a> can be size of a u32. I was imagining ArenaStr<'a> being 16 bytes

16 bytes would be even better! If you can shrink it down :-)

Alright, we are going to solve this problem in a few steps:

  • refactor the Atom API without changing the implementation, i.e. add CompactStr and to_compact_str. @Boshen
  • Change some of the downstream usages to use CompactStr, i.e. all crates other than the AST and parser crate should use CompactStr @Boshen
  • Change Atom to ArenaStr<'a> within the ast and parser crates @Boshen
  • Rework the ArenaStr<'a> implementation to make it inlinable. @overlookmotel

Sounds like a plan!

Do we need CompactStr to be mutable? If not, we could make our own version of CompactStr which has same size and layout as ArenaStr. It would only differ in where it stores out-of-line content for strings over 16 bytes (heap instead of reference to source text/arena).

Advantage of that would be that for strings under 16 bytes, converting from ArenaStr<'a> to CompactStr (or back) would be free - just transmute.

But if CompactStr needs to be mutable, this wouldn't work, as it'd require a capacity field.

16 bytes would be even better! If you can shrink it down :-)

Yes, this is totally possible. Same size as Box<str>, and use compact_str's trick of smuggling inline/arena discriminant + inline length in the last byte (only byte values 0-191 are legal for last byte of a UTF-8 string, so you have values 192-255 free to represent other states).