Segfault with simple stress test of global allocator

Question

Segfault with simple stress test of global allocator

kevin-vigor opened this issue 5 years ago · comments

The following simple stress test fails rapidly with a segfault:

#include <vector>
#include <cstdlib>

static constexpr size_t kMinAllocSz = 800000;
static constexpr size_t kMaxAllocSz = 900000;
static constexpr unsigned kMaxLiveAlloc = 128;  // keep no more than 128 * kMaxAllocSz memory allocated.

int main(void) {
    std::vector<void *> alloc(kMaxLiveAlloc, nullptr);

    while (1) {
        const size_t ix = rand() % kMaxLiveAlloc;
        const size_t sz = (rand() % (kMaxAllocSz - kMinAllocSz)) + kMinAllocSz;

        free(alloc[ix]);
        alloc[ix] = malloc(sz);
    }
    return 0;
}

Reproduce with:

 g++ --std=c++14 -g -Wall -Werror msimple.cpp -o msimple -ldl
LD_PRELOAD=/home/kvigor/mesh/libmesh.so ./msimple

I have attempted to debug this a bit. What I have found is that in GlobalHeap::pageAlignedAlloc() I will see that the pointer returned from mh->mallocAt() == arenaBegin(), and if I check miniheapForLocked(ptr) it does not point to the original miniheap. When I later free this allocation it fails due to the same bogus miniheap pointer being found in ThreadLocalHeap::free(). Unfortunately I have not yet been able to diagnose further.

This error does not occur if I allocate fixed size objects; having some randomness in the allocation size seems to be required to trigger the issue.

(minor sidebar: what is the "locked" implied in miniheapForLocked()? free() calls it with no obvious locks held? A comment and/or renaming of the function might help.)

Bobby Powers · Answer 1 · Thu Mar 07 2019 13:27:13 GMT+0800 (China Standard Time)

@kevin-vigor thanks for the report + reproducer! I can trigger this locally, and am looking now.

Bobby Powers · Answer 2 · Thu Mar 07 2019 13:33:40 GMT+0800 (China Standard Time)

also - the Locked prefix on miniheapFor was vestigal - no lock is needed to look up a miniheap with that function anymore

Bobby Powers · Answer 3 · Thu Mar 07 2019 14:10:04 GMT+0800 (China Standard Time)

Going to have to continue this tomorrow, but again thanks for such a helpful test case + debugging notes.

I've built mesh locally with ./configure --debug --no-optimize, which enables a lot of default-off debug checks. Additionally, I've checked your test case into the repo, and you can build and run it under GDB with mesh like:

./configure --debug --no-optimize
make -j5
make src/test/global-large-stress
gdb src/test/global-large-stress

There are at least 2 things going on here: we aren't properly coalescing/reusing dirty Spans in MeshableArena in this case (exclusively large object allocations), which leads to continuously allocating new spans and enlarging the arena.

Second, this error happens right around when the new allocation is 4 GiB from the start of the arena -- it sure feels like we have some int32_t math that has snuck in somewhere and we are overflowing int.

Bobby Powers · Answer 4 · Sat Mar 09 2019 13:28:13 GMT+0800 (China Standard Time)

@kevin-vigor both issues are fixed, and I can now run the stress test for minutes on my machine. Please shout if you turn up anything else :)

Kevin Vigor · Answer 5 · Sun Mar 10 2019 00:46:50 GMT+0800 (China Standard Time)

Thanks for quick response! I will be banging on allocator more next week, will let you know.