Do not raise exception in case of non-zero exit code of smartctl

Question

Do not raise exception in case of non-zero exit code of smartctl

enrico2828 opened this issue 2 years ago · comments

I tried out this project on our servers and noticed that a disk failure is not correctly handled. The smartctl tool detects a failure, and exits with error code 8. In the python script line 24 this is handled as an exception and the script stops. Hence, for the defect disk we do not have any prometheus metrics at all.
If I delete the statement that raises the exception, I get the metrics in prometheus correctly and am able to detect the disk failure.
I checked in a similar project, https://github.com/PhilipMay/smart-prom-next/blob/main/smart_prom_next/smart_prom_next.py, and here a non zero exit code is handled with showing a warning instead of raising an exception.

Exception: Command returned code 8. Stdout: '{"json_format_version":[1,0],"smartctl":{"version":[7,3],"svn_revision":"5338","platform_info":"x86_64-linux-4.18.0-305.49.1.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","-A","-H","-d","scsi","--json=c","/dev/sdaq"],"exit_status":8},"local_time":{"time_t":1662530983,"asctime":"Wed Sep  7 06:09:43 2022 UTC"},"device":{"name":"/dev/sdaq","info_name":"/dev/sdaq","type":"scsi","protocol":"SCSI"},"smart_status":{"passed":false,"scsi":{"asc":93,"ascq":50,"ie_string":"DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"}},"temperature":{"current":38,"drive_trip":60},"power_on_time":{"hours":8585,"minutes":43},"scsi_start_stop_cycle_counter":{"year_of_manufacture":"2021","week_of_manufacture":"25","specified_cycle_count_over_device_lifetime":50000,"accumulated_start_stop_cycles":54,"specified_load_unload_count_over_device_lifetime":600000,"accumulated_load_unload_cycles":402},"scsi_grown_defect_list":29083}' Stderr: ''

{
	"json_format_version": [1, 0],
	"smartctl": {
		"version": [7, 3],
		"svn_revision": "5338",
		"platform_info": "x86_64-linux-4.18.0-305.49.1.el8_4.x86_64",
		"build_info": "(local build)",
		"argv": ["smartctl", "-A", "-H", "-d", "scsi", "--json=c", "/dev/sdaq"],
		"exit_status": 8
	},
	"local_time": {
		"time_t": 1662539260,
		"asctime": "Wed Sep  7 08:27:40 2022 UTC"
	},
	"device": {
		"name": "/dev/sdaq",
		"info_name": "/dev/sdaq",
		"type": "scsi",
		"protocol": "SCSI"
	},
	"smart_status": {
		"passed": false,
		"scsi": {
			"asc": 93,
			"ascq": 50,
			"ie_string": "DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"
		}
	},
	"temperature": {
		"current": 38,
		"drive_trip": 60
	},
	"power_on_time": {
		"hours": 8588,
		"minutes": 1
	},
	"scsi_start_stop_cycle_counter": {
		"year_of_manufacture": "2021",
		"week_of_manufacture": "25",
		"specified_cycle_count_over_device_lifetime": 50000,
		"accumulated_start_stop_cycles": 54,
		"specified_load_unload_count_over_device_lifetime": 600000,
		"accumulated_load_unload_cycles": 403
	},
	"scsi_grown_defect_list": 29083
}

Matthias · Answer 1 · Tue Sep 13 2022 01:09:56 GMT+0800 (China Standard Time)

Do we need better error handling? Because change from #43 will spam this long stdout everytime some command executes. Or maybe debug mode.

Diego Heras · Answer 2 · Sun Sep 18 2022 05:16:46 GMT+0800 (China Standard Time)

Fixed in 4807aea
v2.1.1