gimli-rs / gimli

A library for reading and writing the DWARF debugging format

Home Page:https://docs.rs/gimli/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a way to share parsed `Abbreviations` between `Unit`s

Swatinem opened this issue · comments

I think I’m hitting an interesting pathological case with a massive 5.5G DWARF file:

The file has ~280k Units, and ~1.5k abbrevs. It looks like those abbrevs are shared by all the units (they all have abbrev_offset = 0).

That means that Unit::new is re-parsing the same ~1.5k abbrevs over and over again, holding them in memory and freeing those in the end.

A way to share the Abbreviations between Units would be nice to avoid this duplicated work.

Do you need to share them (reference them simultaneously from multiple Units), or is it enough to be able to reuse the Abbreviations from a previous Unit that you've finished processing?

We are currently keeping all the Units in memory because there might be cross-references I think.

This means that we can’t easily reuse the Abbreviations from a previous Unit. I wonder if we could lift that restriction at some point, but thus far we construct all the Units lazily and keep them around.

See also getsentry/symbolic#683 for my workaround upstream. I essentially copied all of Unit::new which is unfortunate.

Having a second constructor that takes Abbreviations would make that workaround a bit less ugly. Then I can mem::swap stuff around as I please to avoid the re-parsing.

This is really an interesting case and very much depends on the underlying DWARF data.

For my pathological case, there is only a single Abbreviations table with ~1500 entries in debug_abbrev that is shared by all CUs.

But other files (for example electron.debug) are rather duplicated and seem to have one abbreviation table per CU.

I think the second case already wastes bytes in the raw DWARF by not deduplicating these tables, but on the other hand is only parsing as much as needed in gimli (well, except for the abbrevs that are duplicated across CUs).
The first case however highlights a shortcoming in gimli as it is wasting a lot of cpu and memory to parse the complete Abbreviations table for each CU, even though it could get away with only parsing a single one.

This is definitely a shortcoming in gimli, which I was aware of in the past and forgot about. It wasn't a problem before we added Dwarf and Unit.

mem::swap sounds too error prone. You would need to swap every time before and after calling something that uses it.

Ideally there would be a reference to an abbreviations cache somewhere, perhaps in Dwarf. I'm not sure how hard it will be to design the ownership for that though.

before and after calling something that uses it.

My current workaround is very brittle anyway in the sense that I very carefully avoided calling anything that uses it. Which means things can break easily in the future if people are not careful :-(

And yes, having an Abbreviations cache in Dwarf sounds like a good solution to me. But as you mentioned, ownership might be a pain there and either require some more lifetime parameters, or just going with Arc.

@mstange suggested that it might be dsymutil that merges/deduplicates all the Abbreviations, and indeed, the macOS dSYM for https://github.com/electron/electron/releases/tag/v20.1.4 only has a single Abbreviations table with ~1100 entries. But it only has ~18k CUs which limits the combinatorial explosion.

My hack in getsentry/symbolic#683 to deduplicate the parsing speeds symcache_debug up from ~26 -> ~18 seconds, and peak RSS from ~6.7 G -> ~2.6G, which are very good results.

Here's a dSYM with 1902 shared abbreviations and 9927 CUs, which can be used to test potential fixes: https://storage.googleapis.com/profiler-get-symbols-fixtures/XUL-E2C3444B769A3A1887EDA7C34A07A56C0.dSYM.tar.bz2