Make `Atom` safe and faster
Boshen opened this issue · comments
Atom
had a memory leak which is not desirable: #1803
I removed the memory and miri test to the parser: #2294
The change introduced some performance regressions and a unsafe
oxc/crates/oxc_span/src/atom.rs
Lines 103 to 104 in 6002560
The next goal is to remove this unsafe.
Can I tackle this? I already have a WIP which steals a lot of the efficient code of CompactString
. @Boshen I know you're aware of that, but obviously you ran out of patience waiting for me (I don't blame you). I seem to have got side-tracked by trying to optimize the lexer!
But now that the memory leak is solved, perhaps it's a little less urgent?
I want to enable as many good engineering practices as possible so I was eager to make the code memory-leak free so we can turn on miri.
It wasn't about impatience 😅
Sorry, that wasn't criticism or a barbed comment. It was more of an apology that I said I'd do it, and then I haven't. I started hacking on the lexer just to gain familiarity with the codebase, but have ended up going down a performance-chasing rabbit hole! It's too much fun.
Anyway, I think I'm getting towards the end of that work now. #2288 is the major building block for lexer optimization, and I should have the PR which builds on it to speed up lexing identifiers (the one that gives the 10% parser speed up) ready in a few days. So should be able to turn my attention to Atom
pretty soon.
That sound OK to you?
That sound OK to you?
I can't wait! But take your time and enjoy the fun!
Instead of exposing a single Atom
type, I'm thinking along the lines of:
ArenaStr<'a>
with the functionaries of inlined string + arena allocated strCompactStr
with the functionaries of inlined string + heap allocated str
For maximum performance, ArenaStr<'a>
should be the size of u32, with only a pointer and a length (sans capacity).
The ArenaStr<'a>
will have to_compact_str
and to_string
for downstream usages, the user has the choice of choosing the most convenient one for them.
This is really nice - nice combination of efficiency and usability.
Just to check a couple of things:
ArenaStr<'a>
is what will be used in the AST? to_compact_str
/ to_string
are for downstream users, if they don't want to deal with lifetimes in their own code, or if they want to mutate the string?
I'm unclear how ArenaStr<'a>
can be size of a u32
. I was imagining ArenaStr<'a>
being 16 bytes:
- Bytes 0-7: Pointer
- Bytes 8-11: Length (
u32
) - Bytes 12-15: Unused (though could find uses for this e.g. index into Vec of escaped strings, or flags for "String is in source code" / "String is escaped" / "String is Unicode").
We could probably squeeze it down smaller using pointer compression, but:
- That optimization could also apply to
Vec
s andBox
s in the AST, so I was thinking of it as a broader issue. - 16 is probably a good size for making majority of strings be stored inline (most JS identifiers are under 16 bytes).
Or have I misunderstood, and you have something else in mind?
ArenaStr<'a> is what will be used in the AST?
Yes.
ArenaStr<'a> can be size of a u32. I was imagining ArenaStr<'a> being 16 bytes
16 bytes would be even better! If you can shrink it down :-)
Alright, we are going to solve this problem in a few steps:
- refactor the
Atom
API without changing the implementation, i.e. addCompactStr
andto_compact_str
. @Boshen - Change some of the downstream usages to use
CompactStr
, i.e. all crates other than the AST and parser crate should useCompactStr
@Boshen - Change
Atom
toArenaStr<'a>
within the ast and parser crates @Boshen - Rework the
ArenaStr<'a>
implementation to make it inlinable. @overlookmotel
Sounds like a plan!
Do we need CompactStr
to be mutable? If not, we could make our own version of CompactStr
which has same size and layout as ArenaStr
. It would only differ in where it stores out-of-line content for strings over 16 bytes (heap instead of reference to source text/arena).
Advantage of that would be that for strings under 16 bytes, converting from ArenaStr<'a>
to CompactStr
(or back) would be free - just transmute
.
But if CompactStr
needs to be mutable, this wouldn't work, as it'd require a capacity
field.
16 bytes would be even better! If you can shrink it down :-)
Yes, this is totally possible. Same size as Box<str>
, and use compact_str
's trick of smuggling inline/arena discriminant + inline length in the last byte (only byte values 0-191 are legal for last byte of a UTF-8 string, so you have values 192-255 free to represent other states).
Continue on oxc-project/backlog#46