Add a way to share parsed `Abbreviations` between `Unit`s

Question

Add a way to share parsed `Abbreviations` between `Unit`s

Swatinem opened this issue 2 years ago · comments

I think I’m hitting an interesting pathological case with a massive 5.5G DWARF file:

The file has ~280k Units, and ~1.5k abbrevs. It looks like those abbrevs are shared by all the units (they all have abbrev_offset = 0).

That means that Unit::new is re-parsing the same ~1.5k abbrevs over and over again, holding them in memory and freeing those in the end.

A way to share the Abbreviations between Units would be nice to avoid this duplicated work.

Philip Craig · Answer 1 · Wed Sep 14 2022 20:49:03 GMT+0800 (China Standard Time)

Do you need to share them (reference them simultaneously from multiple Units), or is it enough to be able to reuse the Abbreviations from a previous Unit that you've finished processing?

Arpad Borsos · Answer 2 · Wed Sep 14 2022 20:55:48 GMT+0800 (China Standard Time)

We are currently keeping all the Units in memory because there might be cross-references I think.

This means that we can’t easily reuse the Abbreviations from a previous Unit. I wonder if we could lift that restriction at some point, but thus far we construct all the Units lazily and keep them around.

Arpad Borsos · Answer 3 · Wed Sep 14 2022 21:02:09 GMT+0800 (China Standard Time)

See also getsentry/symbolic#683 for my workaround upstream. I essentially copied all of Unit::new which is unfortunate.

Having a second constructor that takes Abbreviations would make that workaround a bit less ugly. Then I can mem::swap stuff around as I please to avoid the re-parsing.

Arpad Borsos · Answer 4 · Wed Sep 14 2022 21:16:06 GMT+0800 (China Standard Time)

This is really an interesting case and very much depends on the underlying DWARF data.

For my pathological case, there is only a single Abbreviations table with ~1500 entries in debug_abbrev that is shared by all CUs.

But other files (for example electron.debug) are rather duplicated and seem to have one abbreviation table per CU.

I think the second case already wastes bytes in the raw DWARF by not deduplicating these tables, but on the other hand is only parsing as much as needed in gimli (well, except for the abbrevs that are duplicated across CUs).
The first case however highlights a shortcoming in gimli as it is wasting a lot of cpu and memory to parse the complete Abbreviations table for each CU, even though it could get away with only parsing a single one.

Philip Craig · Answer 5 · Wed Sep 14 2022 21:21:06 GMT+0800 (China Standard Time)

This is definitely a shortcoming in gimli, which I was aware of in the past and forgot about. It wasn't a problem before we added Dwarf and Unit.

mem::swap sounds too error prone. You would need to swap every time before and after calling something that uses it.

Ideally there would be a reference to an abbreviations cache somewhere, perhaps in Dwarf. I'm not sure how hard it will be to design the ownership for that though.

Arpad Borsos · Answer 6 · Wed Sep 14 2022 21:25:09 GMT+0800 (China Standard Time)

before and after calling something that uses it.

My current workaround is very brittle anyway in the sense that I very carefully avoided calling anything that uses it. Which means things can break easily in the future if people are not careful :-(

And yes, having an Abbreviations cache in Dwarf sounds like a good solution to me. But as you mentioned, ownership might be a pain there and either require some more lifetime parameters, or just going with Arc.

Arpad Borsos · Answer 7 · Thu Sep 15 2022 20:34:28 GMT+0800 (China Standard Time)

@mstange suggested that it might be dsymutil that merges/deduplicates all the Abbreviations, and indeed, the macOS dSYM for https://github.com/electron/electron/releases/tag/v20.1.4 only has a single Abbreviations table with ~1100 entries. But it only has ~18k CUs which limits the combinatorial explosion.

My hack in getsentry/symbolic#683 to deduplicate the parsing speeds symcache_debug up from ~26 -> ~18 seconds, and peak RSS from ~6.7 G -> ~2.6G, which are very good results.

Markus Stange · Answer 8 · Fri Sep 16 2022 03:31:10 GMT+0800 (China Standard Time)

Here's a dSYM with 1902 shared abbreviations and 9927 CUs, which can be used to test potential fixes: https://storage.googleapis.com/profiler-get-symbols-fixtures/XUL-E2C3444B769A3A1887EDA7C34A07A56C0.dSYM.tar.bz2