SMART - 寻道错误率

Question

计数器会像里程表在用完整数后翻转一样重置。许多设备控制器会有不同的阈值，但计数为 0 并不意味着驱动器没有错误，就像里程表上显示 1,000,010 公里的车辆不是“刚下线”的一样。

如果您想构建如图 2 所示的图表，您可以编写一个小型数据收集实用程序，从存储设备读取 SMART 信息并将其记录在数据库中（或您认为合适的任何地方）。智能工具我通常使用该包来显示存储设备信息。

您可以像这样安装：

打开终端（如果尚未打开）
安装smartmontools软件包：
```
sudo apt install smartmontools
```

查询存储介质，例如NVMe设备：

sudo smartctl --all /dev/nvme0n1

这将给你很多信息：

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.0-17-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLW512HMJP-000L7
Serial Number:                      S359NX0K103156
Firmware Version:                   7L7QCXY7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            81,254,830,080 [81.2 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b181b5c4a3
Local Time is:                      Thu May 27 21:57:29 2021 JST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Warning  Comp. Temp. Threshold:     69 Celsius
Critical Comp. Temp. Threshold:     72 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.60W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     5.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    20,937,566 [10.7 TB]
Data Units Written:                 26,780,407 [13.7 TB]
Host Read Commands:                 359,002,242
Host Write Commands:                683,010,154
Controller Busy Time:               5,130
Power Cycles:                       1,027
Power On Hours:                     3,812
Unsafe Shutdowns:                   85
Media and Data Integrity Errors:    0
Error Information Log Entries:      719
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               33 Celsius
Temperature Sensor 2:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        719     0  0x0008  0x4004      -            0     0     -
  1        718     0  0x0008  0x4004      -            0     0     -
  2        717     0  0x0008  0x4004      -            0     0     -
  3        716     0  0x0008  0x4004      -            0     0     -
  4        715     0  0x0008  0x4004      -            0     0     -
  5        714     0  0x0008  0x4004      -            0     0     -
  6        713     0  0x0008  0x4004      -            0     0     -
  7        712     0  0x0008  0x4004      -            0     0     -
  8        711     0  0x0008  0x4004      -            0     0     -
  9        710     0  0x0008  0x4004      -            0     0     -
 10        709     0  0x0008  0x4004      -            0     0     -
 11        708     0  0x0008  0x4004      -            0     0     -
 12        707     0  0x0008  0x4004      -            0     0     -
 13        706     0  0x0008  0x4004      -            0     0     -
 14        705     0  0x0008  0x4004      -            0     0     -
 15        704     0  0x0008  0x4004      -            0     0     -
... (48 entries not read)

这些信息可能有点太多了，所以你可以只是错误计数如下：

sudo smartctl -l error /dev/nvme0n1

上述命令返回的输出与上一个命令的“错误信息”部分中显示的输出相同。请注意，默认情况下，NVMe 设备最多返回 16 个条目。如果您查询的 NVMe 设备有更多条目，则可以指定要返回的条目数，如下所示：

sudo smartctl -l error,64 /dev/nvme0n1

对于我的设备，总共有 64 个闪存芯片，因此我会,64在上面的命令中添加。您最多可以显示 256 个条目的信息。

希望这能为您提供丰富的信息，供您使用和追踪。

Answer 1