truenas / py-SMART

Wrapper for smartctl (smartmontools)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some data parsing seems wrong?

QiE2035 opened this issue · comments

The code of py-SMART in device.update():

            if 'used endurance' in line:
                pct = int(line.split(':')[1].strip()[:-1])
                self.diagnostics.Life_Left = 100 - pct
            if 'Specified cycle count' in line:
                self.diagnostics.Start_Stop_Spec = int(
                    line.split(':')[1].strip())
            if 'Accumulated start-stop cycles' in line:
                self.diagnostics.Start_Stop_Cycles = int(
                    line.split(':')[1].strip())
                if self.diagnostics.Start_Stop_Spec != 0:
                    self.diagnostics.Start_Stop_Pct_Left = int(round(
                        100 - (self.diagnostics.Start_Stop_Cycles /
                               self.diagnostics.Start_Stop_Spec), 0))
            if 'Specified load-unload count' in line:
                self.diagnostics.Load_Cycle_Spec = int(
                    line.split(':')[1].strip())
            if 'Accumulated load-unload cycles' in line:
                self.diagnostics.Load_Cycle_Count = int(
                    line.split(':')[1].strip())
                if self.diagnostics.Load_Cycle_Spec != 0:
                    self.diagnostics.Load_Cycle_Pct_Left = int(round(
                        100 - (self.diagnostics.Load_Cycle_Count /
                               self.diagnostics.Load_Cycle_Spec), 0))
            if 'Elements in grown defect list' in line:
                self.diagnostics.Reallocated_Sector_Ct = int(
                    line.split(':')[1].strip())
            if 'read:' in line:
                line_ = ' '.join(line.split()).split(' ')
                if line_[1] == '0' and line_[2] == '0' and line_[3] == '0' and line_[4] == '0':
                    self.diagnostics.Corrected_Reads = 0
                elif line_[4] == '0':
                    self.diagnostics.Corrected_Reads = int(
                        line_[1]) + int(line_[2]) + int(line_[3])
                else:
                    self.diagnostics.Corrected_Reads = int(line_[4])
                self.diagnostics.Reads_GB = float(line_[6])
                self.diagnostics.Uncorrected_Reads = int(line_[7])
            if 'write:' in line:
                line_ = ' '.join(line.split()).split(' ')
                if (line_[1] == '0' and line_[2] == '0' and
                        line_[3] == '0' and line_[4] == '0'):
                    self.diagnostics.Corrected_Writes = 0
                elif line_[4] == '0':
                    self.diagnostics.Corrected_Writes = int(
                        line_[1]) + int(line_[2]) + int(line_[3])
                else:
                    self.diagnostics.Corrected_Writes = int(line_[4])
                self.diagnostics.Writes_GB = float(line_[6])
                self.diagnostics.Uncorrected_Writes = int(line_[7])
            if 'verify:' in line:
                line_ = ' '.join(line.split()).split(' ')
                if (line_[1] == '0' and line_[2] == '0' and
                        line_[3] == '0' and line_[4] == '0'):
                    self.diagnostics.Corrected_Verifies = 0
                elif line_[4] == '0':
                    self.diagnostics.Corrected_Verifies = int(
                        line_[1]) + int(line_[2]) + int(line_[3])
                else:
                    self.diagnostics.Corrected_Verifies = int(line_[4])
                self.diagnostics.Verifies_GB = float(line_[6])
                self.diagnostics.Uncorrected_Verifies = int(line_[7])
            if 'non-medium error count' in line:
                self.diagnostics.Non_Medium_Errors = int(
                    line.split(':')[1].strip())
            if 'Accumulated power on time' in line:
                self.diagnostics.Power_On_Hours = int(
                    line.split(':')[1].split(' ')[1])

And my output of smartctl:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.18.14-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC PC SN730 SDBPNTY-512G-1101
Serial Number:                      211515805454
Firmware Version:                   11190001
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      8215
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b48720fd3
Local Time is:                      Wed Jul 27 11:39:39 2022 CST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.50W       -        -    0  0  0  0        0       0
 1 +     3.50W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     4000   10000
 4 -   0.0035W       -        -    4  4  4  4     4000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    63,787,598 [32.6 TB]
Data Units Written:                 14,028,396 [7.18 TB]
Host Read Commands:                 869,234,784
Host Write Commands:                179,000,138
Controller Busy Time:               1,670
Power Cycles:                       1,059
Power On Hours:                     1,562
Unsafe Shutdowns:                   106
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

For example, Life_Left of smartctl is Percentage Used, but py-SMART is used endurance, the two don't seem to match?

Diagnostic information do not have a stable format across device-types in smartctl... It's a very pain in the ass must say.

It depends on if it is a SSD over SATA/SAS or Raid card or if it is a nvme. On each interface we have a very different output.

I don't have an "endurance" entry on my testing dataset, but I think is worth to add "Percentage Used" as 100 - Life_left.
I've to check the docs and verify if that value is what it seems to be.

Also, I'm going to check if we can add more values from the NVMe Log to the diagnostics structure, but, in any case, it may be a better idea to design a refactor.

If you are interested in a subset of those params, please tell me so I can prioritize them.
Thank you.

I've added a new field to device called if_attributes.
It will store specific-iface data.
In the future, this will just a handler and the device will show metrics almost transparent to the iface. (or at least that's the idea).

Please, confirm me that field fixes your issues and you could get the info you need.

To install using pypi try: pip install git+https://github.com/truenas/py-SMART@develop

What about using json format with flag -j? It's very simple to parse with pydantic.

My implementation here with json format and asyncio support, and I need more test data because I don't have any SAS or SCSI drive. Those todos are currently haven't been implemented. In order to get full data of the drive, I add a new attribute raw_data to store the original dict from the process.

Here is my own test data, from my hdd and ssd.
Do you have more? This is very helpful for testing。

Thanks for your contrib @synodriver ,

However...
On the first hand, notice this is a very old project, that's the reason behind not using json on the first place.
On the other hand, the json output (as far as I could test) vary as much as text output. It depends if is ATA, SCSI/SAS or NVMe, the version of the interface itself, the type of disk (ssd vs hd), and a big etc.

Of course it would be easier to migrate everything into a JSON parsing and update the output API of pysmart (which is... not very friendly and very underlaying-protocol dependant) BUT we have tons of test to release something in that line and I don't have every single testcase (on the recent months there were many issues related with cornercases, most of them NON related with text-parsing but mostly on regexes & postprocesing!)

If you really wanna help, we can create a branch and begin working there together. I would upload as much tests as I could and try to do our best. Nonetheless consider this may be a long-term update, as, if we have to wait until we have enough test data to release it confidently.

Json output does not always holds all the output that text output gives. Especially in older smartctl versions. I work with v7.2 and do observe sometimes log parts missing in the json output. In older versions I've seen it more frequently. Text output is more reliable.

The regexes always cause some weird problems on my computer, so I made a decision to use json format instead. As for the output data... to be honest I haven’t notice that.

The regexes always cause some weird problems on my computer, so I made a decision to use json format instead. As for the output data... to be honest I haven’t notice that.

@synodriver , in that case, please, share those issues so we can fix them.

Thank you

The regexes always cause some weird problems on my computer, so I made a decision to use json format instead. As for the output data... to be honest I haven’t notice that.

I've added your tests to the test folder. I haven't found any issue. Please, describe the problem you were having with the regexes