Cosium / zabbix_zfs-on-linux

zabbix template and user parameters to monitor zfs on linux

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does not alert on cksum errors

killmasta93 opened this issue · comments

Hi,
on zabbix it does not alert if the pool has an error on the cksum

root@prometheus26:~# zpool status
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
rpool       ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    sda3    ONLINE       0     0     0
    sdb3    ONLINE       0     0     0
  mirror-1  ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0    33

You are right, this is not currently supported, usually you will be alerted by other metrics outside of ZFS for disk errors.

Nevertheless, this could be a good improvement.

Thanks for the reply, is there going to be planned to update this?
Thank you

@killmasta93 : I cannot give you a specific date. Currently the template doesn't handle the discovery of the vdevs, I took a quick look and I didn't see any other way than the parsing of the output of "zpool status" to get the list of the vdevs, which is not that easy and seems a little brittle.

I'll get back to you when I have some time to look further.

I have started the implementation. I got the list of all vdev with their state and read, write and checksum error counters.

For the alerting, I think that I'll raise an alert when any counter is > 0, but only once. I don't think there is any value to raising 2 or 3 alerts if a vdev has more than 1 counter > 0.

For example in your case, it will raise an alert saying "vdev /dev/sdd has 33 errors". If you got 5 write errors and 33 checksum error, it will instead say "vdev /dev/sdd has 38 errors". I want to avoid 2 alerts for the same vdev like "vdev /dev/sdd has 33 checksum errors" and "vdev /dev/sdd has 5 write errors".

Thanks for the reply, should i update the script? to see on the alert? as i still have not clear the cksum error on my pool

@killmasta93 : not yet, I'm still testing it and it's not public yet. It should be done by the end of week if everything goes well. I'll tell you when it's done.

thank you again, if i can help in anyway let me know :)

@killmasta93 testing is done and the new userparameters and template have been deployed on my infrastructure. I actually found out that I had an error on one disk with it!

It was a good idea ;-)

Thank you so much, im glad it helped the idea, quick question for updating do i need to download the template.xml and replace it?

edit: just updated it and got the alert thank very much

you're welcome