Weekly High IO on cluster

Question

Weekly High IO on cluster

johnseekins opened this issue 3 years ago · comments

Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (-retentionPeriod=365d). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.
Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).

Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.

To Reproduce

Have large ingest rate (>~800k points/sec) to large cluster (28 storage nodes)
wait a week

Expected behavior
No sudden weekly IO spikes.

Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).

Screenshots
Load profile:

Re-routing of data:

Ceph objects being removed:

Version
1.63.0

Used command-line flags

 -search.maxUniqueTimeseries=300000000 -search.maxTagKeys=1000000 -search.maxTagValues=1000000000 -dedup.minScrapeInterval=1s -memory.allowedPercent=75 -storageDataPath=/var/lib/victoriametrics/storage/prod_cluster_1 -retentionPeriod=365d -vminsertAddr :8400 -vmselectAddr :8401 -httpListenAddr :8482

Roman Khavronenko · Answer 1 · Thu Aug 05 2021 16:43:06 GMT+0800 (China Standard Time)

Hi @johnseekins! Thank you for so detailed issue!
Could you please provide some additional screenshots form our dashboards:

storage. LSM parts
storage. Disk writes/reads
storage. Active merges
resource usage. Open FDs
overview. Disk space used

Thank you!

John Seekins · Answer 2 · Thu Aug 05 2021 21:55:48 GMT+0800 (China Standard Time)

active merges and lsm parts:

disk r/w and fds:

Disk used:

And a few other interesting panels:

There is definitely a huge spike in indexdb and small merges during that time, but they seem to be symptoms (as they spike up suddenly during the event and then taper off...)

Roman Khavronenko · Answer 3 · Thu Aug 05 2021 22:58:15 GMT+0800 (China Standard Time)

Hm, the LSM parts graph suggests VM has about 6-8k parts on disk. And "vm_hdd Objects" (if I'm reading it right) suggests there 3kk objects were deleted over the 2.5h. Which does not correlate with amount of parts either merged or deleted by VM.
The spike in ActiveMerges can be explained by RowsRerouted graph - every time vmstorage receives new time series it results into new parts for index on disk and consequent merges...

Is there any chance that some type of cronjob enabled in the OS which runs every Monday at midnight? I can recall a similar issue with cronjob for fstrim process causing lags for our SSDs every week.

John Seekins · Answer 4 · Thu Aug 05 2021 23:53:04 GMT+0800 (China Standard Time)

We can't find anything...but I'll check again. The one suspicious thing was an MD scan of the hypervisor's OS drive...but that only happens once a month...and triggers two hours after the event starts.

But fstrim on the boxes is scheduled for Monday at midnight!

root@store-1:~# systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2021-06-16 14:08:16 UTC; 1 months 19 days ago
    Trigger: Mon 2021-08-09 00:00:00 UTC; 3 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

Jun 16 14:08:16 store-1 systemd[1]: Started Discard unused blocks once a week.

So drive trimming is causing this? That's strange, given the data drive is a ceph block device and the OS drive is an Openstack block device. But I'm ready to believe it!

I've disabled fstrim on all the storage nodes ('cause whether or not that's the actual problem, there's no reason for it to be running on these hosts).

John Seekins · Answer 5 · Mon Aug 09 2021 21:22:33 GMT+0800 (China Standard Time)

This was fstrim. It may be worth documenting that fstrim can have an adverse affect on the system somewhere?

Roman Khavronenko · Answer 6 · Tue Aug 10 2021 14:21:35 GMT+0800 (China Standard Time)

Hm, I'm not sure this is exactly relevant to VM. The case I mentioned happened to me 5y ago and it was a Postgres cluster suffering every Sunday because of fstrim. I was just lucky to recall it now and ask you to check if there is something similar in your system.
However, here's commit to mention fstrim in Tuning section.

John Seekins · Answer 7 · Tue Aug 10 2021 21:27:11 GMT+0800 (China Standard Time)

It wasn't directly related to VictoriaMetrics, no. But disabling fstrim on these virtual hosts meant the weekly I/O spikes didn't happen.