Mysterious stack corruption

Question

Mysterious stack corruption

DeanoBurrito opened this issue 2 years ago · comments

Currently unable to reproduce this bug myself, known to be present in 04de59a and older. OS was run using a standard ubuntu LTS install (qemu 4.2).

Output of readelf -e startup.elf here: https://pastebin.com/NqKK3hsx
Output of addr2line on faulting address:

$ addr2line -e initdisk/apps/startup.elf -a 0x2025f9
0x00000000002025f9
/home/xyz/Documents/Github/northport/libs/np-syscall/SyscallFunctions.cpp:11

Logging output of the kernel from the crash:

r4 · Answer 1 · Thu May 05 2022 10:44:06 GMT+0800 (China Standard Time)

A few things worth noting so far:

We're page faulting at address 0, in user code. This is likely a nullptr deref somewhere.
There's no active VM ranges, so the crash is happening before any MapMemory syscalls. This means before the userspace heap is initialized.
From the selectors and saved instruction pointer we know it's crashing in the program code, not a syscall.
The code at the faulting address is:

bool LoopbackTest()
    {
        SyscallData data((uint64_t)SyscallId::LoopbackTest, 0, 0, 0, 0);
        DoSyscall(&data);

        return data.id == SyscallSuccess; // <--- this line here
    }

This is bizzare as this syscall is never called, so we should never reach this point.

r4 · Answer 2 · Mon May 09 2022 19:34:12 GMT+0800 (China Standard Time)

Looks like memory corruption. I think I've been able to reproduce this by issuing lots (100s) of IPC calls when running emulated in qemu (instead of kvm). It seems to be a processes saved registers are getting corrupted with kernel heap node addressses. Since some nodes may contain nullptr for next/prev this would make sense with above.

It's also happened with idle threads and the other user program.
Not sure if this is a buffer overrun, use-after-free error or something else yet.

Edit: this seems like a good excuse to rewrite the kernel heap, and see if I can figure out the cause along the way.

r4 · Answer 3 · Thu May 12 2022 19:34:04 GMT+0800 (China Standard Time)

This was a good excuse to rewrite the kernel heap (finished as of today), and with the help of some new debugging features I can confirm that we're not overrunning any heap allocated buffers. The heap is much faster now too, so silver linings!

If it's not a heap buffer, I'm guessing it's a stack overflow haha. I only had a quick look, but it seems we're stilling only allocating a single page for any spawned threads. Program stacks are also identity mapped or mapped in the hhdm, meaning a kernel stack can very easily touch other nearby physical pages, since there's no unmapped pages around it.

I'm thinking I might start allocating kernel and user stacks at a fixed virtual address, or maybe within a range (so they're not predictable). The hhdm is useful, but dangerous to return as 'allocated' memory anywhere. I might do a pass over the kernel so that the hhdm is only used to directly access memory, and all other accesses are done via a virtual mapping. We could enforce this by marking those pages as non-present when not in use, for help with debugging.

r4 · Answer 4 · Sat May 14 2022 22:48:18 GMT+0800 (China Standard Time)

After an experiment with allocating each thread's stack at a unique address, surrounded by unmapped pages (so any overruns would case a page fault) it looks like its not going to be that simple. We're not exceeding the bounds of any stacks anywhere.
This means that the issue is not from writing data outside of the bounds of either stack or heap allocated buffers.

At this stage I can only seem to reproduce 2 distinct issues:

a page fault with kernel cs/ss but both %rip and %rsp are kernel heap pointers (%rip should be code). The pointer always seems to be a pointer within the heap. Over various runs it's been a pointer within the pool, or one of the slabs. Strange.
a GP fault during an interrupt return, somehow with kernel selectors and rip, but the stack pointer is the user allocated one. Very dangerous. It is in fact the user thread's stack, and it seems to be misaligned +/- 3 bytes sometimes, which is bizzare.

This leaves a few other options I can think of:

We're still allocating a pointer to the inside of the hhdm somewhere, and writing to it, thus writing to random memory.
It's corruption within the stack, some stack allocated buffer being overrun and corrupting things above it. This is possible as the iret_rip field is the first iret field to be accessed when climbing the stack like that. Strange that it leaves the iret_cs field intact though.

r4 · Answer 5 · Sat May 14 2022 23:09:46 GMT+0800 (China Standard Time)

Just did a quick test, the memory corruption (page fault issue) still occurs when not user programs are loaded.

r4 · Answer 6 · Sun May 15 2022 23:01:57 GMT+0800 (China Standard Time)

With SMAP being fully enabled (at least of the dev branch) I've squashed the GP fault with kernel selectors and a user stack. The page fault still remains however. I have realised I allocated the kernel stack for threads inside of the hhdm, this might be worth looking into. Although that stack isn't used for much, so it seems unlikely.

My bet is still on corruption within the stack somewhere.

r4 · Answer 7 · Mon May 16 2022 14:27:21 GMT+0800 (China Standard Time)

Two more points of interest:

bug is not present when running under VMware or VirtualBox, only qemu. I dont have bochs installed so not tested there.
There have been a few cases of page tables getting corrupted, with the data looking like an iret frame to userspace. These are always in the upper 64 bytes or so, which would line up with where the stack would be if that page was used.

r4 · Answer 8 · Wed May 18 2022 19:03:46 GMT+0800 (China Standard Time)

Tentatively fixed in c8c4386 (currently on dev, not mirrored on master). Leaving this issue open until I can confirm the issue is truly gone.

r4 · Answer 9 · Wed Jun 01 2022 12:08:45 GMT+0800 (China Standard Time)

Seems like there was another stack corruption issue: this one seems to be related to the scheduler. It's present even when no user space programs are run, and will occur randomly over time. Rapid inputs will trigger it to occur faster (mouse movement/key presses). I've only been able to reproduce this on qemu, again. VirtualBox and real hardware work as expected.