Crash of .NET Core 2 on LX-branded SmartOS

The crash happens whenever the login page of my ASP.NET Core website is visited. The crash does not happen on .NET Core 1.1 and earlier or when running on native linux.

In the csharp folder is a minimal reproduction of the bug in the form of a .NET Core console app.

The c folder contains a C program that can detect the problem.

A potential fix for this problem is on my my fork of illumos-joyent.

Root Cause

Consider program that uses an alternate stack to handle a signal.Inside this signal handler it changes it stack pointer to different location outside of the alternate stack. If the signal handler unmasks the signal and triggers the signal again before returning from the signal handler, Linux starts executing the nested signal handler at the start of the alternate stack. LX notices that the signal handler never returned and does not use the alternate stack.

Why .NET Core crashes

When .NET Core runs on Linux, it uses a SIGSEGV handler to turn SIGSEGVs in managed code into NullReferenceExceptions. When installing the SIGSEGV handler, .NET Core uses sigaltstack(2) to define an alternate stack.

When the SIGSEGV handler runs, it first checks to see if the fault address is near the end of the stack the thread was running on at the time of the fault. If so, it aborts the program after printing an stack-overflow error message.

This is where .NET starts to abuse the signal facility. It switches back to executing on the original stack, albeit a bit below where the fault occured. The SEHProcessException that processes the exeception may never it return. After executing the any catch handlers it finds, it unwinds the stack and restores the CPU context to resume executing. The signal handler never returns.

The SmartOS LX signal dispatcher records when it starts executing a signal handler using the alternate stack from sigaltstack(2). If another signal is raised while before the first signal handler returns, the alternate stack is not used. This means if the signal handler never returns, the only time the alternate stack will be used is the first time an exception is dispatched. This differs the Linux behavior. Linux appears to check to see if the stack pointer at the time the fault occured lies within the alternate stack. If the stack pointer is not contained within the alternet stack, Linux always starts executing the handler at the top of the alternate stack.

The SmartOS behavior causes problems for .NET Core. The SIGSEGV handler captures the machine context using .NET Core's RtlCaptureContext function and stores this information on the alternate stack. ExecuteHandlerOnOriginalStack passes a pointer to this structure to signal_handler_worker when it pivots to executing on the orignal stack. Unforunatly on SmartOS, the second time the SIGSEGV handler is executed it will run on the regular stack. When it thinks it pivoting back to a different stack, it is actually is the same stack in about the same place the SmartOS picked to setup the stack. signal_handler_worker clobbers the information stored on the stack by the SIGSEGV handler and evently the iretq instruction in RtlRestoreContext segfaults when trying to restore the clobbered machine state.

Addtional Notes

CoreCLR started using the alternate signal stack in 2.0: dotnet/coreclr#9650.

AustinWise / CrashRepro

Crash of .NET Core 2 on LX-branded SmartOS

Root Cause

Why .NET Core crashes

Addtional Notes

About

Languages