NVMe 驱动器故障？SMART 错误信息日志条目快速增加

2024-11-6 • tag-icon

Error Information Log Entries我的 NVMe 中显示的值增长smartctl -a /dev/nvme0n1很快，每秒增加 1 个。这是否表示驱动程序有故障？

同时，Media and Data Integrity Errors当前显示值为0。

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SKC3000D4096G
Serial Number:                      xxxxx
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 282b2ba6c5
Local Time is:                      Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0620W       -        -    4  4  4  4     2500    7500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    213,006,510 [109 TB]
Data Units Written:                 549,370,112 [281 TB]
Host Read Commands:                 11,210,192,197
Host Write Commands:                20,687,602,229
Controller Busy Time:               14,055
Power Cycles:                       39
Power On Hours:                     4,204
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,479,242
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               75 Celsius
Thermal Temp. 1 Total Time:         58745

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0    1479242     0  0x2015  0x4004 0x102c            0     0     -
  1    1479241     0  0x2014  0x4004 0x102c            0     0     -
  2    1479240     0  0xd010  0x4004 0x102c            0     0     -
  3    1479239     0  0xc013  0x4004 0x102c            0     0     -
  4    1479238     0  0xb011  0x4004 0x102c            0     0     -
  5    1479237     0  0x8009  0x4004 0x102c            0     0     -
  6    1479236     0  0x0015  0x4004 0x102c            0     0     -
  7    1479235     0  0x0014  0x4004 0x102c            0     0     -
  8    1479234     0  0xa011  0x4004 0x102c            0     0     -
  9    1479233     0  0xa010  0x4004 0x102c            0     0     -
 10    1479232     0  0x9012  0x4004 0x102c            0     0     -
 11    1479231     0  0x9011  0x4004 0x102c            0     0     -
 12    1479230     0  0x6000  0x4004 0x102c            0     0     -
 13    1479229     0  0x5003  0x4004 0x102c            0     0     -
 14    1479228     0  0x4001  0x4004 0x102c            0     0     -
 15    1479227     0  0x4000  0x4004 0x102c            0     0     -
... (47 entries not read)

我也上传了输出nvme error-log /dev/nvme0n1： https://pastebin.com/SQJM7KhV

答案1

就我而言，这是由 Node Exporter（Prometheus）引起的。

停止进程后，该值Error Information Log Entries停止增加。可能是它发出了 NVMe 驱动程序不支持的查询（必须深入挖掘）。

更新：我编辑了 hwmon 收集器代码以排除故障传感器：https://github.com/prometheus/node_exporter/issues/2643

答案1

相关内容