m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform

Home Page:https://m3db.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

m3db large memory spikes when removing nodes from clusters

BertHartm opened this issue · comments

We've seen twice now that removing a node from placement on m3db v1.3.0 causes all of the nodes in that isolation group to bootstrap (expected as the shards move), and then as that process is completing, the memory and go routines are rising rapidly on other nodes causing them to run out of memory and crash.

Generally our cluster is running about 80% capacity and using less than half the machine ram before we start the node removal. It's failed 2 out of 2 times since we've upgraded the cluster so it does seem reproducible.

commented

Some context. Single node was removed from placement at 15:17
image