SnapRAID Metrics Collector

This script collects various metrics from SnapRAID operations like sync, scrub, and smart and outputs them in a format compatible with Prometheus's textfile collector.

You can find this dashboard here

Prerequisites

SnapRAID installed and configured
Node Exporter with textfile collector enabled

Usage

To run the script, use the following command:

sudo ./snapraid_metrics_collector.sh [smart|scrub|sync]

You can specify one or more arguments to execute specific operations. For example:

sudo ./snapraid_metrics_collector.sh smart # to run the smart operation.
sudo ./snapraid_metrics_collector.sh scrub # to run the scrub operation.
sudo ./snapraid_metrics_collector.sh sync # to run the sync operation.
sudo ./snapraid_metrics_collector.sh smart sync # to run both smart and sync operations.

Integration with Prometheus Node Exporter

Place the script in a directory, e.g., /usr/local/bin.

Make it executable: chmod +x /usr/local/bin/snapraid_metrics_collector.sh.

Configure a cron job to run the script periodically and output to a textfile collector directory:

# Run snapraid sync every day at 1 AM
0 1 * * * /usr/local/bin/snapraid_metrics_collector.sh sync > /var/lib/node_exporter/textfile_collector/snapraid_sync.prom
# Run snapraid scrub once a week on Sunday at 3 AM
0 3 * * Sun /usr/local/bin/snapraid_metrics_collector.sh scrub > /var/lib/node_exporter/textfile_collector/snapraid_scrub.prom
# Run snapraid smart every day at 5 AM
0 5 * * * /usr/local/bin/snapraid_metrics_collector.sh smart > /var/lib/node_exporter/textfile_collector/snapraid_smart.prom

Adjust the cron schedule according to your requirements.

Configure Node Exporter to read metrics from this directory. This is usually done by passing the --collector.textfile.directory flag to Node Exporter with the path to the directory. Modify the Node Exporter service file accordingly.

For example, if you are using a systemd service to manage Node Exporter, edit the service file (typically located at /etc/systemd/system/node_exporter.service or /lib/systemd/system/node_exporter.service) and add the flag to the ExecStart line:

ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

After modifying the service file, reload the systemd configuration and restart the Node Exporter service:

sudo systemctl daemon-reload
sudo systemctl restart node_exporter

Metrics

The script generates the following metrics:

Metric Name	Description
`snapraid_smart_exit_status`	Exit status of the last SnapRAID smart run.
`snapraid_smart_last_ran`	Timestamp of the last SnapRAID smart run.
`snapraid_smart_disk_temperature`	Disk temperature in degrees Celsius.
`snapraid_smart_disk_power_on_days`	Number of days the disk has been powered on.
`snapraid_smart_disk_error_count`	Number of errors reported by the disk.
`snapraid_smart_disk_fail_probability`	Fail probability for individual disks within the next year based on SMART values calculated by SnapRAID.
`snapraid_smart_total_fail_probability`	Fail probability for any disk failing within the next year based on SMART values calculated by SnapRAID.
-	-
`snapraid_scrub_exit_status`	Exit status of the last SnapRAID scrub run.
`snapraid_scrub_last_run`	Timestamp of the last SnapRAID scrub run.
`snapraid_scrub_scan_time_seconds`	Scan time for each item during SnapRAID scrub operation, in seconds.
`snapraid_scrub_file_errors`	Number of file errors found during SnapRAID scrub.
`snapraid_scrub_io_errors`	Number of I/O errors found during SnapRAID scrub.
`snapraid_scrub_data_errors`	Number of data errors found during SnapRAID scrub.
`snapraid_scrub_completion_percent`	Completion percentage of the SnapRAID scrub operation.
`snapraid_scrub_accessed_mb`	Amount of data accessed during the SnapRAID scrub operation, in MB.
-	-
`snapraid_sync_exit_status`	Exit status of the last SnapRAID sync run.
`snapraid_sync_last_run`	Timestamp of the last SnapRAID sync run.
`snapraid_sync_scan_time_seconds`	Scan time for each item during SnapRAID sync operation, in seconds.
`snapraid_sync_file_errors`	Number of file errors found during SnapRAID sync.
`snapraid_sync_io_errors`	Number of I/O errors found during SnapRAID sync.
`snapraid_sync_data_errors`	Number of data errors found during SnapRAID sync.
`snapraid_sync_completion_percent`	Completion percentage of the SnapRAID sync operation.
`snapraid_sync_accessed_mb`	Amount of data accessed during the SnapRAID sync operation, in MB.

Alerts

- name: Disk Alerts
  rules:
    - alert: Snapraid Disk Failure Probability
      expr: snapraid_sync_disk_fail_probability > 15
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Snapraid Disk Failure on {{ $labels.instance }} - {{ $labels.job }}
        description: "Snapraid Disk Failure (current value: {{ $value }})"

    - alert: Snapraid Total Failure Probability
      expr: snapraid_sync_total_fail_probability > 40
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Snapraid Total Failure on {{ $labels.instance }} - {{ $labels.job }}
        description: "Snapraid Total Failure (current value: {{ $value }})"

Logging

The script logs each SnapRAID command to a serperate file in the same directory a the script in smart.log, scrub.log, and sync.log files.

ljmerza / snapraid-collector