Add a way to share parsed `Abbreviations` between `Unit`s
Swatinem opened this issue · comments
I think I’m hitting an interesting pathological case with a massive 5.5G DWARF file:
The file has ~280k Units, and ~1.5k abbrevs. It looks like those abbrevs are shared by all the units (they all have abbrev_offset = 0).
That means that Unit::new
is re-parsing the same ~1.5k abbrevs over and over again, holding them in memory and freeing those in the end.
A way to share the Abbreviations
between Unit
s would be nice to avoid this duplicated work.
Do you need to share them (reference them simultaneously from multiple Unit
s), or is it enough to be able to reuse the Abbreviations
from a previous Unit
that you've finished processing?
We are currently keeping all the Unit
s in memory because there might be cross-references I think.
This means that we can’t easily reuse the Abbreviations
from a previous Unit
. I wonder if we could lift that restriction at some point, but thus far we construct all the Unit
s lazily and keep them around.
See also getsentry/symbolic#683 for my workaround upstream. I essentially copied all of Unit::new
which is unfortunate.
Having a second constructor that takes Abbreviations
would make that workaround a bit less ugly. Then I can mem::swap
stuff around as I please to avoid the re-parsing.
This is really an interesting case and very much depends on the underlying DWARF data.
For my pathological case, there is only a single Abbreviations
table with ~1500 entries in debug_abbrev
that is shared by all CUs.
But other files (for example electron.debug) are rather duplicated and seem to have one abbreviation table per CU.
I think the second case already wastes bytes in the raw DWARF by not deduplicating these tables, but on the other hand is only parsing as much as needed in gimli (well, except for the abbrevs that are duplicated across CUs).
The first case however highlights a shortcoming in gimli as it is wasting a lot of cpu and memory to parse the complete Abbreviations
table for each CU, even though it could get away with only parsing a single one.
This is definitely a shortcoming in gimli, which I was aware of in the past and forgot about. It wasn't a problem before we added Dwarf
and Unit
.
mem::swap
sounds too error prone. You would need to swap every time before and after calling something that uses it.
Ideally there would be a reference to an abbreviations cache somewhere, perhaps in Dwarf
. I'm not sure how hard it will be to design the ownership for that though.
before and after calling something that uses it.
My current workaround is very brittle anyway in the sense that I very carefully avoided calling anything that uses it. Which means things can break easily in the future if people are not careful :-(
And yes, having an Abbreviations
cache in Dwarf
sounds like a good solution to me. But as you mentioned, ownership might be a pain there and either require some more lifetime parameters, or just going with Arc
.
@mstange suggested that it might be dsymutil that merges/deduplicates all the Abbreviations
, and indeed, the macOS dSYM for https://github.com/electron/electron/releases/tag/v20.1.4 only has a single Abbreviations
table with ~1100 entries. But it only has ~18k CUs which limits the combinatorial explosion.
My hack in getsentry/symbolic#683 to deduplicate the parsing speeds symcache_debug
up from ~26 -> ~18 seconds, and peak RSS from ~6.7 G -> ~2.6G, which are very good results.
Here's a dSYM with 1902 shared abbreviations and 9927 CUs, which can be used to test potential fixes: https://storage.googleapis.com/profiler-get-symbols-fixtures/XUL-E2C3444B769A3A1887EDA7C34A07A56C0.dSYM.tar.bz2