Parse Nagios's performance data

Question

Parse Nagios's performance data

goblin opened this issue 11 years ago · comments

Currently Nagios's performance data (http://nagiosplug.sourceforge.net/developer-guidelines.html#PLUGOUTPUT) is being simply treated as a string and appended to the Riemann's "description" field.

It would be nice if this data was parsed and sent as separate metrics.

Brian Hatfield commented 11 years ago

Thanks!

goblin · Answer 1 · Wed Apr 24 2013 23:55:03 GMT+0800 (China Standard Time)

Have you got any thoughts on how to best do it? I think we'd have to send multiple events to riemann, one for each piece of performance data. This might complicate things a little.

Brian Hatfield · Answer 2 · Thu Apr 25 2013 00:01:12 GMT+0800 (China Standard Time)

I'd love to parse this data! Do you happen to have a link to a plugin that actually outputs this data? In all my years of running various Nagios installs, I've never noticed one output this data. Maybe I just wasn't looking close enough :-)

If I can see what the output looks like, I can improve the NagiosTask to properly parse it.

goblin · Answer 3 · Thu Apr 25 2013 00:02:44 GMT+0800 (China Standard Time)

Sure, for instance the http check:

% ./check_http -H google.com
HTTP OK: HTTP/1.1 301 Moved Permanently - 559 bytes in 0.031 second response time |time=0.031230s;;;0.000000 size=559B;;;0

Or the SSH one:

% ./check_ssh localhost     
SSH OK - OpenSSH_6.0p1 Debian-4 (protocol 2.0) | time=0.014560s;;;0.000000;10.000000

goblin · Answer 4 · Thu Apr 25 2013 00:03:45 GMT+0800 (China Standard Time)

(they're from nagios-plugins-basic debian sid package version 1.4.16-1)

(edited, it's basic, not standard)

Brian Hatfield · Answer 5 · Thu Apr 25 2013 01:46:58 GMT+0800 (China Standard Time)

Okay, so I am doing some legwork to update python-bernhard to support Riemann 2+'s attributes field, which I think is a good way to send over an n-sized set of metric data for a given event.

I'm not sure that it's a good idea to send many events to get multiple metrics for the same service, because when I think about performing actions on their state, I don't want 5 alerts/pages for one service failure.

I still would like to pick one of the labels (when the amount of performance data is len(n) > 1) as the canonical 'metric', but I'm not really sure I can come up with a good 'rule' to choose. I might just go with index 0 of the parsed return string.

Thoughts?

Brian Hatfield · Answer 6 · Thu Apr 25 2013 02:44:12 GMT+0800 (China Standard Time)

Here's the upstream PR to update bernhard to support 'attributes': https://github.com/banjiewen/bernhard/pull/6

Brian Hatfield · Answer 7 · Thu Apr 25 2013 04:09:29 GMT+0800 (China Standard Time)

Also, it appears that field name collisions will be ignored: for example, 'time' is an event field as well as a performance data field. Which means we're probably going to need to prefix it :-/

Brian Hatfield · Answer 8 · Thu Apr 25 2013 06:06:51 GMT+0800 (China Standard Time)

Added a first pass, but it needs some tuneup and cleanup: df8c956

goblin · Answer 9 · Thu Apr 25 2013 17:56:00 GMT+0800 (China Standard Time)

Whoah, that was quick!

These new attributes look like a great use case for this indeed, I wasn't aware they existed :-)

Had a quick test and it looks pretty good, one minor issue is that my response time of 0.035 seconds or so gets rounded and the attribute ends up as :task_time "0.0".

But wow, many, many thanks for implementing this so quickly :-)

goblin · Answer 10 · Thu Apr 25 2013 18:58:54 GMT+0800 (China Standard Time)

I've fixed the rounding problem with an extra dot in the regex: #3
(pretty minor of course;-)

Brian Hatfield · Answer 11 · Thu Apr 25 2013 22:46:01 GMT+0800 (China Standard Time)

Okay, I tidied up the parsing code a little bit more. There's more to be done here, I think, around making it really bulletproof (ie; performance data returned but it is nonsensical/invalid), but this should be good enough for common use cases.

90c68b6