matusnovak / prometheus-smartctl

HDD S.M.A.R.T exporter for Prometheus written in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do not raise exception in case of non-zero exit code of smartctl

enrico2828 opened this issue · comments

I tried out this project on our servers and noticed that a disk failure is not correctly handled. The smartctl tool detects a failure, and exits with error code 8. In the python script line 24 this is handled as an exception and the script stops. Hence, for the defect disk we do not have any prometheus metrics at all.
If I delete the statement that raises the exception, I get the metrics in prometheus correctly and am able to detect the disk failure.
I checked in a similar project, https://github.com/PhilipMay/smart-prom-next/blob/main/smart_prom_next/smart_prom_next.py, and here a non zero exit code is handled with showing a warning instead of raising an exception.

Exception: Command returned code 8. Stdout: '{"json_format_version":[1,0],"smartctl":{"version":[7,3],"svn_revision":"5338","platform_info":"x86_64-linux-4.18.0-305.49.1.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","-A","-H","-d","scsi","--json=c","/dev/sdaq"],"exit_status":8},"local_time":{"time_t":1662530983,"asctime":"Wed Sep  7 06:09:43 2022 UTC"},"device":{"name":"/dev/sdaq","info_name":"/dev/sdaq","type":"scsi","protocol":"SCSI"},"smart_status":{"passed":false,"scsi":{"asc":93,"ascq":50,"ie_string":"DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"}},"temperature":{"current":38,"drive_trip":60},"power_on_time":{"hours":8585,"minutes":43},"scsi_start_stop_cycle_counter":{"year_of_manufacture":"2021","week_of_manufacture":"25","specified_cycle_count_over_device_lifetime":50000,"accumulated_start_stop_cycles":54,"specified_load_unload_count_over_device_lifetime":600000,"accumulated_load_unload_cycles":402},"scsi_grown_defect_list":29083}' Stderr: ''

{
	"json_format_version": [1, 0],
	"smartctl": {
		"version": [7, 3],
		"svn_revision": "5338",
		"platform_info": "x86_64-linux-4.18.0-305.49.1.el8_4.x86_64",
		"build_info": "(local build)",
		"argv": ["smartctl", "-A", "-H", "-d", "scsi", "--json=c", "/dev/sdaq"],
		"exit_status": 8
	},
	"local_time": {
		"time_t": 1662539260,
		"asctime": "Wed Sep  7 08:27:40 2022 UTC"
	},
	"device": {
		"name": "/dev/sdaq",
		"info_name": "/dev/sdaq",
		"type": "scsi",
		"protocol": "SCSI"
	},
	"smart_status": {
		"passed": false,
		"scsi": {
			"asc": 93,
			"ascq": 50,
			"ie_string": "DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"
		}
	},
	"temperature": {
		"current": 38,
		"drive_trip": 60
	},
	"power_on_time": {
		"hours": 8588,
		"minutes": 1
	},
	"scsi_start_stop_cycle_counter": {
		"year_of_manufacture": "2021",
		"week_of_manufacture": "25",
		"specified_cycle_count_over_device_lifetime": 50000,
		"accumulated_start_stop_cycles": 54,
		"specified_load_unload_count_over_device_lifetime": 600000,
		"accumulated_load_unload_cycles": 403
	},
	"scsi_grown_defect_list": 29083
}

Do we need better error handling? Because change from #43 will spam this long stdout everytime some command executes. Or maybe debug mode.

Fixed in 4807aea
v2.1.1