m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform

Home Page:https://m3db.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

index not updated correctly when removing nodes from placement

BertHartm opened this issue · comments

When scaling down a cluster, we're noticing that some data becomes unavailable for query. It appears in the form of partial results when querying for older data (from before the scale down).

We're also noticing that database_tick_index_num_docs remains flat for each node through the scale down, and then jumps up once the node is restarted. The effect if summing across all nodes the cluster is that the metric drops (when the old node is removed), and recovers to prior level when the remaining nodes restart.

General Issues

What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)

m3db

What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).

can provide if required, but I think this might be general
RF=3

How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?

issue relates to reads happening via remote read

Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.

It appears to be consistent when removing nodes from placements. It's more obvious when the clusters are small as more of the index is affected.

Hey @BertHartm - there’s been some fixes and tests added to cover this category of bugs. As far as we know there are not outstanding bugs in this space, so perhaps I can take our tests and run it against the version you’re running.

Is this 1.3 or 1.5? The exact SHA would be helpful as we investigate this. Thanks for reporting!

I believe 1.5 is the version in question, we’ll test whether this recent patch (post 1.5 release) fixes what you’ve observed:
#4193

sorry, yes, this is 1.5.0 as released. dbnode Sha is e7df2b9