m3db large memory spikes when removing nodes from clusters

Question

m3db large memory spikes when removing nodes from clusters

BertHartm opened this issue 3 years ago · comments

We've seen twice now that removing a node from placement on m3db v1.3.0 causes all of the nodes in that isolation group to bootstrap (expected as the shards move), and then as that process is completing, the memory and go routines are rising rapidly on other nodes causing them to run out of memory and crash.

Generally our cluster is running about 80% capacity and using less than half the machine ram before we start the node removal. It's failed 2 out of 2 times since we've upgraded the cluster so it does seem reproducible.

M · Answer 1 · Sat Nov 20 2021 04:17:53 GMT+0800 (China Standard Time)

Some context. Single node was removed from placement at 15:17