Cosium / zabbix_zfs-on-linux

zabbix template and user parameters to monitor zfs on linux

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

items can become unsupported e.g. through division by zero

tvtue opened this issue · comments

commented

Hi,
the calculated item "ZFS ARC Cache Hit Ratio" with the key zfs.arcstats_hit_ratio has become unsupported on one of my monitored hosts. The reason is given as "Cannot evaluate expression: division by zero."

It is calculated with this formular:
100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])))

Would it be worth doing this a little more sophisticated so that the divisor should never be zero.?

I don't see how this can happen on a used system since zfs.arcstats[hits] or zfs.arcstats[misses] cannot be 0 at the same time.

Did this happen on a completely unused system where the arc is not used at all?

commented

It occurs a few moments after applying the template to a host. Then the items zfs.arcstats[misses] and zfs.arcstats[hits] are 0 (zero) and so the formula does a division by zero.

100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])))

I've watched the item becoming supported as soon as zfs.arcstats[hits] gets a value sometime.

commented

I must take back my last comment partly. I am still seeing the item unsupported. I looked into this again and first I noticed that SELinux may have been a problem. I saw denials for zabbix-agent doing his thing and so I changed some SELinux rules to give it access. Still the two items zfs.arcstats[hits] and zfs.arcstats[misses] are zero in the zabbix frontend (latest data). So I tried to manually get them with zabbix_get -s ... -k ... which works. As they are agent active items I also raised the log level of the zabbix agent to see problems if any. This is what I am seeing for "misses":

16846:20200609:101858.685 EXECUTE_STR() command:'awk '/^misses/ {printf $3;}' /proc/spl/kstat/zfs/arcstats' len:5 cmd_result:'49282'
16846:20200609:101858.685 for key [zfs.arcstats[misses]] received value [49282]
16846:20200609:101858.685 In process_value() key:'myhost:zfs.arcstats[misses]' lastlogsize:null value:'49282'
16846:20200609:101858.685 In send_buffer() host:'my_zabbix_server_ip' port:10051 entries:14/100
16846:20200609:101858.685 send_buffer() now:1591690738 lastsent:1591690737 now-lastsent:1 BufferSend:5; will not send now
16846:20200609:101858.685 End of send_buffer():SUCCEED
16846:20200609:101858.685 buffer: new element 14
16846:20200609:101858.685 End of process_value():SUCCEED
16846:20200609:101858.685 In need_meta_update() key:zfs.arcstats[misses]
16846:20200609:101858.685 End of need_meta_update():FAIL
16846:20200609:101858.685 In send_buffer() host:'my_zabbix_server_ip' port:10051 entries:15/100
16846:20200609:101858.685 send_buffer() now:1591690738 lastsent:1591690737 now-lastsent:1 BufferSend:5; will not send now
16846:20200609:101858.685 End of send_buffer():SUCCEED

I am not sure what "End of need_meta_update():FAIL" means but I would asume that it is not relevant in this problem is it?

Anyway, I don't know why this happens and how I can debug this further.

Do you have an idea or a tip for me?

Did you use sudo to run the zabbix-agent commands? You can also give the zabbix user a shell to test as the zabbix user. You should have the same result as the agent this way.

sudo -u zabbix zabbix_agentd -t zfs.arcstats[miss]
commented

Hi AceSlash, thank you for your reply.
Here is the output from the sudo command test.

[root@ub31 ~]# sudo -u zabbix zabbix_agentd -t zfs.arcstats[miss]
zfs.arcstats[miss]                            [t|7637468]

Okay, the result is correct. I'm not sure how to debug from here... a quick fix would be maybe to add 1 to the formula so that (last(zfs.arcstats[hits])+last(zfs.arcstats[misses])) would never be 0, even on unused system.

This is definitively an edge case, but changing the formula to this would prevent any division by 0:

 100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])+1))
commented

Thank you for your fix. I applied the new formula and the item stayed supported since then. So no division by zero any more. Thank you.

It's better way to avoid it and have a correct data is:

 100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+count(zfs.arcstats[hits],#1,0)+last(zfs.arcstats[misses])+count(zfs.arcstats[misses],#1,0)))

It's approved solution from zabbix team :)

@sharewax : smart! I had to look at the count documentation but for anyone wondering what this does, the count will return 1 if the last value is 0, else it will return 0.

As a result, when the zfs.arcstats[hits] is 0, we will have 1, and same for zfs.arcstats[misses]. Actually, we don't need both, just one will be enough to avoid the division by 0.

This makes the formula do the same thing but is shorter:

100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+count(zfs.arcstats[hits],#1,0)+last(zfs.arcstats[misses])))

I'll make the change to master.