linux-nvme / nvme-cli

NVMe management command line interface.

Home Page:https://nvmexpress.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prometheus exporter for SMART and OCP C0 Log Page

jmhands opened this issue · comments

Is there any roadmap for native integration for a Prometheus exporter? I saw some changes coming in json, it would be good to align any exporters on a specific format. My suggestion would be to track drives by "sn" with info on "mn" and "fw" from sudo nvme id-ctrl /dev/nvme1n1 -o json then have a standard option for exporting statistics to Prometheus from sudo nvme smart-log /dev/nvme1n1 -o json and the OCP log page C0.

A small issue is the C0 log page isn't available in the older releases, but this should be the most helpful log along with the normal smart-log to be able to calculate WAF for workloads with "Physical media units written". Other data in the OCP log will be useful for tracking fleet health across many NVMe SSDs.

works with the latest app image provided
sudo ./nvme-cli-latest-x86_64.AppImage ocp smart-add-log /dev/nvme0n1 -o json

smart-log is using the log page 0x02 as defined in the NVME base spec. ocp smart-add-log reports the content for the vendor specific log page 0xc0.

If I understand you correctly, you would like to have a command which fused the output for both log page. I don't think we should mingle the existing commands.

Though first we need to figure out if we should solve this on the level of nvme-cli (I suppose this is your question on a roamap).

Anyway, couldn't this be something on top of nvme-cli which does the right thing? Technically, we could solve this as yet another plugin.

Thoughts?

@arthurshau @keithbusch @hreinecke

I wrote a sample exporter with chatgpt (not very good but works)
https://github.com/jmhands/nvme_exporter/tree/main
There are some 3rd party attempts but one coming from nvme-cli directly I could make some standard grafana dashboards. The use case I have in mind is graphing WAF over time for various workloads and monitoring the health of SSD fleet.

Thanks for the python code, helps to understand what your really need as input for the integration.

I think we should first define a schema. Could you append here a complete JSON formatted one?

Sure, for the POC I just had pretty much everything that wasn't static be a gauge type but possible we want some of the items in the SMART log as a counter, such as errors or things that reset to zero after reboot. I think it would be fairly straightforward, use exactly the same names as nvme-cli and NVM Express use for NVMe SMART Log. For OCP log page, you can see the spec here https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-5-pdf section 4.8.6 for C0 log page. Most SSD vendors will now support this on data center SSD, and nvme-cli ocp plug-in already parses it just fine. For SSD that doesn't support it, just export the smart-log. Some of the other exporters tried getting SSD info from nvme list but better just to get the model, serial, and firmware from nvme id-ctrl or you could get from libnvme directly with the identify command.

There are several github projects out there based on the parsing of the nvme-cli command output and I have been using for some time the Node textfile exporter. I have recently updated it with parallel execution and included a Grafana dashboard. This has its own limitations.

I think that to avoid yet another exporter that is not aligned neither with the NVMe specifications nor with Prometheus:

  • NVMe CLI has a golang library for reporting the smart information
  • The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

To calculate things like WAF, that do not have directly an opcode, it could be possible to use Prometheus recording rules.

There are several github projects out there based on the parsing of the nvme-cli command output and I have been using for some time the Node textfile exporter. I have recently updated it with parallel execution and included a Grafana dashboard. This has its own limitations.

I think that to avoid yet another exporter that is not aligned neither with the NVMe specifications nor with Prometheus:

  • NVMe CLI has a golang library for reporting the smart information
  • The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

To calculate things like WAF, that do not have directly an opcode, it could be possible to use Prometheus recording rules.

Node exporter is great and that would be a good place to add. It would be nice to have the flexibility to enable more logs pages (like the OCP one I mentioned) for predictive failure and health monitoring for large deployments, but even the base NVMe smart-log in node exporter would be awesome. I found the smart exporter for smartmontools doesn't do SAS properly, for example, hence why I want to do right from nvme-cli instead of smartmontools.

The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

Just to avoid any confusion, the existing nvme collector in node_exporter exposes information gleaned from sysfs, i.e., that which is exposed by the kernel.

It is against node_exporter policy to call external binaries. The textfile collector mechanism is exempt from this, since the apps which generate the textfile metrics are not called by node_exporter itself.

It is against node_exporter policy to call external binaries. The textfile collector mechanism is exempt from this, since the apps which generate the textfile metrics are not called by node_exporter itself.

I had this suspect. Probably the exporter should have some sort of alignment with the nvme-cli release, i wonder if it should be an exporter or a script for the textfile extension.