VictoriaMetrics / VictoriaMetrics

VictoriaMetrics: fast, cost-effective monitoring solution and time series database

Home Page:https://victoriametrics.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weekly High IO on cluster

johnseekins opened this issue · comments

Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (-retentionPeriod=365d). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.
Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).

Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.

To Reproduce

  1. Have large ingest rate (>~800k points/sec) to large cluster (28 storage nodes)
  2. wait a week

Expected behavior
No sudden weekly IO spikes.

Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).

Screenshots
Load profile:
image
Re-routing of data:
image
Ceph objects being removed:
Screenshot from 2021-08-04 10-35-49

Version
1.63.0

Used command-line flags

 -search.maxUniqueTimeseries=300000000 -search.maxTagKeys=1000000 -search.maxTagValues=1000000000 -dedup.minScrapeInterval=1s -memory.allowedPercent=75 -storageDataPath=/var/lib/victoriametrics/storage/prod_cluster_1 -retentionPeriod=365d -vminsertAddr :8400 -vmselectAddr :8401 -httpListenAddr :8482

Hi @johnseekins! Thank you for so detailed issue!
Could you please provide some additional screenshots form our dashboards:

  • storage. LSM parts
  • storage. Disk writes/reads
  • storage. Active merges
  • resource usage. Open FDs
  • overview. Disk space used

Thank you!

active merges and lsm parts:
Screenshot from 2021-08-05 07-49-13
disk r/w and fds:
Screenshot from 2021-08-05 07-51-23
Disk used:
Screenshot from 2021-08-05 07-53-22
And a few other interesting panels:
Screenshot from 2021-08-05 07-53-04

There is definitely a huge spike in indexdb and small merges during that time, but they seem to be symptoms (as they spike up suddenly during the event and then taper off...)

Hm, the LSM parts graph suggests VM has about 6-8k parts on disk. And "vm_hdd Objects" (if I'm reading it right) suggests there 3kk objects were deleted over the 2.5h. Which does not correlate with amount of parts either merged or deleted by VM.
The spike in ActiveMerges can be explained by RowsRerouted graph - every time vmstorage receives new time series it results into new parts for index on disk and consequent merges...

Is there any chance that some type of cronjob enabled in the OS which runs every Monday at midnight? I can recall a similar issue with cronjob for fstrim process causing lags for our SSDs every week.

We can't find anything...but I'll check again. The one suspicious thing was an MD scan of the hypervisor's OS drive...but that only happens once a month...and triggers two hours after the event starts.

But fstrim on the boxes is scheduled for Monday at midnight!

root@store-1:~# systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2021-06-16 14:08:16 UTC; 1 months 19 days ago
    Trigger: Mon 2021-08-09 00:00:00 UTC; 3 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

Jun 16 14:08:16 store-1 systemd[1]: Started Discard unused blocks once a week.

So drive trimming is causing this? That's strange, given the data drive is a ceph block device and the OS drive is an Openstack block device. But I'm ready to believe it!

I've disabled fstrim on all the storage nodes ('cause whether or not that's the actual problem, there's no reason for it to be running on these hosts).

This was fstrim. It may be worth documenting that fstrim can have an adverse affect on the system somewhere?

Hm, I'm not sure this is exactly relevant to VM. The case I mentioned happened to me 5y ago and it was a Postgres cluster suffering every Sunday because of fstrim. I was just lucky to recall it now and ask you to check if there is something similar in your system.
However, here's commit to mention fstrim in Tuning section.

It wasn't directly related to VictoriaMetrics, no. But disabling fstrim on these virtual hosts meant the weekly I/O spikes didn't happen.