eliben / pyelftools

Parsing ELF and DWARF in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Insane memory consumption because of caching mechanism

ShellCode33 opened this issue · comments

Hey there, thanks a lot for your awesome work. I'd like to report an issue I have in regard to memory consumption.

I'm working with a huge ELF file (914MiB) and I'm trying to iterate over DIEs. Thing is I am unable to do so because the OOM killer kicks in at some point (FYI I have 32GiB of RAM). I'm only interested in a few DIEs from the whole lot so there's no need for me to cache everything.

Though I managed to hack my way through by reimplementing methods of CompileUnit and DIE to remove cache usage, I think it would be great to be able to disable the cache completely if there's no need for it.

I don't have time to implement this right now, but maybe someone could add an optional use_cache parameter in CompileUnit's constructor that would enable users of this library to choose what they want (CPU time vs RAM usage).

Did I hear that right - 32 MB of RAM on the machine where you parse? Forget about Python and rework in C. The Python runtime alone will happily eat most of that. Or parse on a real computer and pass the parse results to that 32 MB thing.

Oops, I meant 32GiB of RAM obviously

In general, this library is probably not a great fit for very large inputs. It has several performance issues and wasn't designed to ingest enormous ELF binaries.

Yes that's what what I figured, I went with gimli in Rust and wrote Python bindings

Given its intensely interlinked nature (even more so in v5), DWARF doesn't easily lend itself to firehose, cache-nothing parsing. Take, for example, abbrevs. There is a reference to the abbrev table in every DIE. Would be crazy to return to the abbrev table and parse one again on every DIE.

Also, to begin with, pyelftools starts by pulling every debug related section into a memory buffer, and does the rest of work against those. That alone will add some gigs to your working set.

I can envision a version of pyelftools that is optimized for memory, but the one we got isn't it.

@ShellCode33: what exactly was your desired use case? I can see top-to-bottom iterating through stuff sans caching, but random access sans caching will incur a crazy amount of I/O.