yberreby / rgo

[STALLED] A Go compiler, written in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

String interning

yberreby opened this issue · comments

We might want to intern identifier strings. String interning speeds up comparisons, which are likely going to be very frequent during various stages of the compilation pipeline. It should also reduce memory allocation during parsing.

rustc uses a string interner, from which we could take inspiration.

commented

Is is possible that rgo will have some kind of parallel parsing? In this case, we should consider making the interner thread-safe with synchronisation primitives.

Yes, it is possible. I'm not sure a shared interner would be very efficient, though, because of false sharing and the synchronization overhead. Maybe it would be more efficient not to intern during parsing, and to do a single interning pass once all files have been parsed?

commented

Well, that seems more like delaying the problem than solving it, if semantic/codegen could be parallel.

Does it? AFAICT we shouldn't need to mutate the interner once parsing is complete, so the other passes could read interned strings concurrently without false sharing nor synchronization.

commented

Hmmm, I guess you're right on that.

commented

Not sure how names would be stored in the AST, if interning is done in a separate pass.

They could be stored as an enum that can hold either a String or a interner ID, I guess.

The problem with this approach (interning as a separate pass) is that we're allocating many strings just to throw them away immediately, though. I wonder how other parallel compilers do it?

Your synchronized approach is starting to look more interesting. And we could convert the synchronized interner to one without synchronization once parsing is finished, since we won't mutate it further.

commented

Liking the sound of that idea.

The enum, or stripping away the synchronization after parsing?

commented

I was referring to the stripping away method.

However, I now think it could be a bad idea to get rid of synchronisation after parsing. I'm sure there are valid use cases for modifying/adding identifiers in semantic analysis, especially if rgo is used as a library in another crate.

commented

I've got a working implementation using RwLock right now. I don't see any way stripping away synchronisation would work.

I don't see any way stripping away synchronisation would work.

Here's how I see it.

This is rustc's interner:

pub struct Interner<T> {
    map: RefCell<HashMap<T, Name>>,
    vect: RefCell<Vec<T> >,
}

The idea is to have two different structs: one where the HashMap and the Vec are wrapped in RwLock / Mutex, and one where they are stored as-is, or in a RefCell (I don't know why a RefCell is used in rustc's interner). During parsing, we'd use the first kind of struct, then we'd move out the map and the vector into the "plain" struct for use in other passes.

Given the added complexity of parallel parsing and the design trade-offs it implies, maybe it would be better to use a sequential implementation first, then, when the project is more mature, switch to a parallel one if the performance gain is substantial.

commented

The biggest problem I see is from a usability standpoint.

If we strip of synchonisation, then if one thread modifies an interner, the other threads won't see it.
Unless, we make the result read-only, but that provides a limited API to clients and semantic passes.

I was thinking of making it read-only, yes. Cloning the interner seems wasteful.

Could you provide an example of a situation where a semantic pass would need to modify the interner? One case I can think of would be rewriting the AST to apply our own optimizations before LLVM.

Either way, I think this is premature optimization. Parsing is usually much, much faster than other passes such as codegen, and we cannot make an informed decision yet: the parser is not finished, and the type checking, semantic analysis, translation and codegen passes have not been written.

commented

I agree, let's leave it until later.