Monitoring cache/log drives

Question

Monitoring cache/log drives

cstackpole opened this issue 4 years ago · comments

Greetings,
Thank you very much for this Zabbix template! I just put this on my systems a week ago and it's working great for me on my ZFS systems so far. Some of the things its pointed out to me in warnings has gotten me to learn more about ZFS to better optimize my system so I appreciate that a lot!

I did run into an issue that I would like to monitor for future reference, but I am not sure how/best way.

zpool status shows my configuration (cleaned up a bit; and it's idle right now):

config:

	NAME                                 STATE     READ WRITE CKSUM
	vmpool                               ONLINE       0     0     0
	  mirror-0                           ONLINE       0     0     0
	    ata-ST4000DM004                  ONLINE       0     0     0
	    ata-ST4000DM004                  ONLINE       0     0     0
	  mirror-1                           ONLINE       0     0     0
	    ata-ST3000DM007                  ONLINE       0     0     0
	    sda                              ONLINE       0     0     0
	logs	
	  sdb                                ONLINE       0     0     0
	cache
	  sdg                                ONLINE       0     0     0

For the workload on this box, the cache and log drives make a HUGE difference - as in between usable and barely-but-frustrating-usable. Saturday night the log SSD went completely kaput. Like it just blinked out of the system (I've had many SSD's report they were running great only to insta-die on me so I'm never surprised when they die anymore). Now I've been running a weekly Smart check (also recorded into Zabbix) so it would have alerted me eventually that a drive was missing, however, what I noticed was that on Sunday the system was sluggish and by Monday morning it was painfully slow. ZFS just went back to using memory so it was still reporting health OK! I ran out to pick-up a replacement SSD and within minutes of me swapping the drive into the system as the new log device it was humming along great! That's when I went to Zabbix to figure out how to alert me faster should this happen again.

I've already changed my smart checks to daily from weekly so that will tell me if another SSD just insta-dies. However, I was hoping to see some information about the cache and log files as captured by this template. I only see the "CHECKSUM/READ/WRITE error counter" and "total number of errors" which (unsurprisingly to me) never flagged anything but zeros on the failed drive.

Thoughts on what values might be good to monitor for the cache/log drives? I was thinking about something like "the total number of drives in the ZFS pool just shrank!" or something like that.

I also thought about trying to scrape data out of zpool iostat -v pool_name. Under heavy load it isn't unusual to see my log SSD become tens of GB, but most of the time it is a few hundred K at most. The cache drive is very frequently full though (again, right now the box is pretty idle).

logs                                     -      -      -      -      -      -
  sdb                                 544K   111G      0      2      0   149K
cache                                    -      -      -      -      -      -
  sdg                                 105G  7.10G      0      1  59.7K   219K

Not sure capturing the log info will be useful as the heavy loads usually process pretty quick and the spike in use would probably not be reflected in Zabbix fast enough. However, the use of the cache drive might be interesting as it fluctuates a lot.

Hopefully that's not too much of a info dump, but I thought it was worth asking you before I tried to hack away on the template to add new metrics. Any thoughts on the best way to capture this data and alert me faster should the SSD for log/cache blink out on me again?

Just in case it is of interest, I am running on the latest kernel for SL 7.7 with zfs-0.8.3-1.

Thanks!

AceSlash · Answer 1 · Thu Mar 19 2020 06:13:24 GMT+0800 (China Standard Time)

hey @cstackpole ,

I'm surprised that your log device did just "disappear", its state should have changed to "UNAVAIL".

I'll have to test this particular scenario on a test system.

About the zpool iostat -v pool_name, yes that's something I want to do but it is surprisingly difficult because the first return of zpool iostat is always wrong, you have to wait for the second result to have real values.

What I would like to do is find the "real" values from the low level metrics for ZoL but I didn't have the time to look at it yet.

~Stack~ · Answer 2 · Fri Mar 27 2020 02:24:49 GMT+0800 (China Standard Time)

@AceSlash Thank you for following up and I apologize for the delay. The whole world is going nuts and I got caught up in the absurdity with my job.

I FINALLY got time today to get back to this issue. I think that I had a failure in a very specific corner case. Because from what I can tell in the logs the bad SSD just popped out of existence and ZFS didn't even flag that bad state. The SSD just disappeared entirely. If I didn't have log files clearly showing that it was configured correctly and working only two days before I would be tempted to think that the drive just didn't actually exist....

I spent some time tinkering and doing all kinds of things to the replacement SSD (including just pulling its power) and in every case the system notes that the drive just disappeared, ZFS goes into degraded state, and your ZFS template works perfectly to alert me. No matter what I do, I can't replicate what happened to my previous SSD drive. 🤷‍♂️

After messing about with it today I don't think this has anything to do with Zabbix or your template, so I'm closing this issue. IF I do manage to replicate it and I think it might be relevant to this template, I will open a new issue. Otherwise, thank you for your time and effort in this template. It is much appreciated!

AceSlash · Answer 3 · Fri Mar 27 2020 19:23:17 GMT+0800 (China Standard Time)

Ok, that's strange indeed, keep me posted if you manage to reproduce this!