Question: Is there a special reason for using libunwind on systems with an elf based file format?

Question

Question: Is there a special reason for using libunwind on systems with an elf based file format?

RalfKornmannEnvision opened this issue 4 years ago · comments

RalfKornmannEnvision commented 4 years ago

I totally understand that it's the natural solution to include the unwind information as DWARF when the object file format is ELF. And then use libunwind to make use of this data. It's the same way LLVM based compilers do it.

But while stepping through the actual unwind function (stepWithDwarf) in the past could not avoid to notice that this whole process is very inefficient. First it needs to decode and parse the dwarf FDE. After this it loops over all possible registers to check if they are even used and if the code needs to figure out what type of register it is and how it needs to be handled. This loop might not that bad on X86 and X64 with only 8 and 32 possible registers. But for ARM64 there are already 95 and it got even worse with ARM that uses 287.

This all might not be a big issue for C++ and similar languages were exceptions are literal exceptions or not used at. But with C# and .Net we have a GC that needs to walk the stack quite regular.

If I understand it correct the regular VM for .Net Core/.Net 5 has a custom stack walk that is more in line with what MSVC does:
ARM: https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=vs-2019
AMD64: https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019
ARM64: https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=vs-2019

This looks way more efficient.

Therefore I like to ask if there is a reason to not use a custom stack walk solution in CoreRT, too?
I must confess that I haven't profiled how bad it is in reality. But as GC pause are one of the bigger issues with game development in C# I get a little bit uneasy when seeing all these loops in the stepping code.

Michal Strehovský · Answer 1 · Wed Sep 09 2020 18:34:40 GMT+0800 (China Standard Time)

I think the main reason was that it was the fastest to bring up. Same reason why we use LLVM to write out object files instead of a handwritten PE/ELF/Mach-O emitter.

There's some discussion around that here: #3784. Again, deferring to @jkotas opinion but I think we would still want a handwritten unwinder eventually.

RalfKornmannEnvision · Answer 2 · Wed Sep 09 2020 18:52:00 GMT+0800 (China Standard Time)

Sorry, I might need to start to do better searches on old issues.

If there is still interest in replacing libunwind for the managed code parts I might take this on for AMD64 & ARM64 & ELF/DWARF based systems. We have a huge interest in getting the GC as fast as possible for these Platforms.

Jan Kotas · Answer 3 · Wed Sep 09 2020 22:05:14 GMT+0800 (China Standard Time)

As @MichalStrehovsky said.

RalfKornmannEnvision · Answer 4 · Mon Sep 14 2020 22:24:32 GMT+0800 (China Standard Time)

I took another look at this and think the best way to implement this is by adding some "Fast Unwind Data" to the LSDA. 64 or even 32 bits should be enough. This will only need a new flag and some additional data add the end of the currenz LSDA block.

If the code needs to unwind it can check if the flag is set and use the fast path if not it can fall back to the libunwind solution that is already there.

This way we don't need to touch the DWARF unwind data at all during runtime for most of the managed code. The calculation of the "Fast Unwind Data" could be done in the compiler. instead. If it detects that it is not a simple case the flag and data are just not set.

Jan Kotas · Answer 5 · Mon Sep 14 2020 22:29:13 GMT+0800 (China Standard Time)

I would prefer to avoid having two different variants of unwind data. Stack unwinding is a bug farm for hard to debug crashes. Having two very different paths makes it much worse.

RalfKornmannEnvision · Answer 6 · Tue Sep 15 2020 05:48:54 GMT+0800 (China Standard Time)

I understand your concerns. If we need to handle the unwinding of all managed and unmanaged code with a single code path the options for optimization will be limited. The only thing I can see is the loop over all registers. If this is done processor specific it can be unrolled and the checking for the register type is needed. As we have no control over the unwind data that the c compiler generates we still need the full parse engine for the Unwinddata.

Jan Kotas · Answer 7 · Tue Sep 15 2020 05:54:55 GMT+0800 (China Standard Time)

libunwind in CoreRT is only used to unwind RyuJIT generated code. It just needs to handle unwind codes that RyuJIT is producing.

RalfKornmannEnvision · Answer 8 · Tue Sep 15 2020 06:03:46 GMT+0800 (China Standard Time)

Than I misunderstood the linked old discussion. It sounded like there is need to walk code that is generated by RyuJit and code that comes from the C compiler. If it's just the RyuJit code we can solve this with just one code path.

Michal Strehovský · Answer 9 · Tue Sep 15 2020 15:49:07 GMT+0800 (China Standard Time)

Yeah, the conversation in #3784 diverged to CoreCLR in the second post. CoreCLR has a lot of "manually managed" C++ code that the GC needs to operate on. CoreRT avoided all the trouble associated with that by just writing everything in C#.

RalfKornmannEnvision · Answer 10 · Sun Sep 20 2020 18:12:51 GMT+0800 (China Standard Time)

Implemented a custom unwinder for ARM64 #8345
Tested it by letting it run in parallel with the libunwind solution and comparing the produced regdisplay values from both.