sudden increase in CPU across all nodes in a cluster causing query failure

Question

sudden increase in CPU across all nodes in a cluster causing query failure

BertHartm opened this issue 3 years ago · comments

We saw a sudden jump of CPU across all dbnodes in our cluster, which caused queries to timeout. Writes continued working. The profiles (as I read them) seem to hint at a locking issue one query, which would make sense with the behavior observed and the structure we have set up (2 clusters on the same query path had issues at the same time (other clusters on that path did not though))

m3db and m3query are both on v1.2.0

Filing M3 Issues

General Issues

General issues are any non-performance related issues (data integrity, ease of use, error messages, configuration, documentation, etc).

Please provide the following information along with a description of the issue that you're experiencing:

What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)

M3DB

What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).

included in dump files

How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

this would have been remote reads of prometheus data

Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.
pprof.m3dbnode.samples.cpu.100.pb.gz

Not that we're aware of

Performance issues

If the issue is performance related, please provide the following information along with a description of the issue that you're experiencing:

What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)

m3db

Approximately how many datapoints per second is the service handling?

onecluster is at ~38k samples/s per node, and the other is around ~122k samples/s per node

What is the approximate series cardinality that the series is handling in a given time window? I.E How many unique time series are being measured?

Ticking is fairly stable between 4-6M per node across both clusters

What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?

32 core / 256G ram, not shared. single 2TB disk for storage

What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

In addition to the above information, CPU and heap profiles are always greatly appreciated.

CPU / Heap Profiles

CPU and heap profiles are critical to helping us debug performance issues. All our services run with the net/http/pprof server enabled by default.

Instructions for obtaining CPU / heap profiles for various services are below, please attach these profiles to the issue whenever possible.

M3Coordinator

CPU
curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/profile?seconds=5 > m3coord_cpu.out

Heap
curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/heap > m3coord_heap.out

M3DB

CPU
curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/profile?seconds=5 > m3db_cpu.out

Heap
curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/heap -> m3db_heap.out

M3DB Grafana Dashboard Screenshots

If the service experiencing performance issues is M3DB and you're monitoring it using Prometheus, any screenshots you could provide using this dashboard would be helpful

.

lots of pprof from one node
[dump.zip](https://github.com/m3db/m3/files/7427847/dump.zip
CPU from another (different cluster, same incident)
pprof.m3dbnode.samples.cpu.100.pb.gz
)

Linas Medžiūnas · Answer 1 · Fri Oct 29 2021 21:32:00 GMT+0800 (China Standard Time)

@BertHartm I have looked at the CPU profile that you have attached. It has the same pattern as the flame graph that you can see in #3813 (a fix which was a part of https://github.com/m3db/m3/releases/tag/v1.3.0 release).

Bert Hartmann · Answer 2 · Sat Oct 30 2021 06:37:15 GMT+0800 (China Standard Time)

Thanks, yeah. that seems related. Time to start looking for weird queries that match the pattern.

I'll close this and move to 1.3