The Prometheus Nagios exporter reads status and performance data from nagios plugins via the MK Livestatus Nagios plugin and publishes this in a form that can be scrapped by Prometheus.
Setup is as simple as installing the livestatus module and then running the
nagios_exporter.py
service.
echo 'broker_module=/usr/lib/check_mk/livestatus.o /var/lib/nagios3/rw/livestatus' >> /etc/nagios3/nagios.cfg
echo 'event_broker_options=-1' >> /etc/nagios3/nagios.cfg
Restart Nagios, and start the exporter:
./nagios_exporter.py --path /var/lib/nagios3/rw/livestatus
It should then be possible to visit:
http://localhost:5000/metrics
Every metric is prefixed with nagios_
, following the metric naming best
practices. The prefix is followed by the name of the Nagios check
command, such as nagios_check_load_
. The metric name suffix comes from various
Nagios status names. For example, a load service check for localhost
would
include the following metrics:
nagios_check_load_exec_time{hostname="localhost", service="Load"} 0.011084
nagios_check_load_latency{hostname="localhost", service="Load"} 0.078
nagios_check_load_state{hostname="localhost", service="Load"} 0
nagios_check_load_flapping{hostname="localhost", service="Load"} 0
nagios_check_load_acknowledged{hostname="localhost", service="Load"} 0
Every metric is also labeled with the hostname and service description.
To facilitate ease of maintenance in Prometheus rulesets, an alternative
mechanism is provided if the flat --command_labels
is set. The metrics
will now contain the command as a label, like so:
nagios_command_exec_time{hostname="localhost", command="check_load", service="Load"} 0.011084
nagios_command_latency{hostname="localhost", command="check_load", service="Load"} 0.078
nagios_command_state{hostname="localhost", command="check_load", service="Load"} 0
nagios_command_flapping{hostname="localhost", command="check_load", service="Load"} 0
nagios_command_acknowledged{hostname="localhost", command="check_load", service="Load"} 0
The reason why this is useful is because of aggregation on the prometheus side.
With the command as a label, the only rulesets we'd need asre ones for exec_time
,
latency
, state
, flapping
, acknowledged
and possibly perf_data_value
.
When new commands are added, no prometheus changes are necessary.
If we keep the command in the metric name, each new command requires rulesets on Prometheus (or its graphing consoles, or Grafana). See Issue #13 for discussion.
Performance data is plugin-specific. Though there is a common format that most
plugins follow. Performance data follows plugin output starting with |
.
Typically, the format is a set of key=value1[;value2]+
strings. For example:
$ check_disk <some args>
DISK OK - free space: / 2400 MB (69% inode=83%);| /=2400MB;48356;54400;0;60445
By default, nagios-exporter only parses the first value of performance data for every key. The default field name is 'value'. And, the key is always added as a metric label. So, for example, the default metric output for the above performance data would be:
nagios_check_disk_perf_data_value{key="/", ...} 2516582400.0
More specific names can be assigned to each value position of particular check
plugins using the --data_names
flag. For example:
--data_names="check_disks=used;free;;;total"
So, instead of only parsing the first value and using the default name, now the metric output for the original example will include three values each named and corresponding to the respective value in the raw perf data:
nagios_check_all_disks_perf_data_used{key="/", ...} 2516582400.0
nagios_check_all_disks_perf_data_free{key="/", ...} 50704941056.0
nagios_check_all_disks_perf_data_totalkey="/", ...} 63381176320.0
./nagios_exporter.py --path /var/lib/nagios3/rw/livestatus \
--perf_data --perf_names="check_disks=used;free;;;total" \
--whitelist nagios_check_all_disks_perf_data