Error Information Log Entries
我的 NVMe 中显示的值增长smartctl -a /dev/nvme0n1
很快,每秒增加 1 个。这是否表示驱动程序有故障?
同时,Media and Data Integrity Errors
当前显示值为0。
=== START OF INFORMATION SECTION ===
Model Number: KINGSTON SKC3000D4096G
Serial Number: xxxxx
Firmware Version: EIFK31.6
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 282b2ba6c5
Local Time is: Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d): Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08): Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 89 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.80W - - 0 0 0 0 0 0
1 + 7.10W - - 1 1 1 1 0 0
2 + 5.20W - - 2 2 2 2 0 0
3 - 0.0620W - - 3 3 3 3 2500 7500
4 - 0.0620W - - 4 4 4 4 2500 7500
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 55 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 8%
Data Units Read: 213,006,510 [109 TB]
Data Units Written: 549,370,112 [281 TB]
Host Read Commands: 11,210,192,197
Host Write Commands: 20,687,602,229
Controller Busy Time: 14,055
Power Cycles: 39
Power On Hours: 4,204
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 1,479,242
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 2: 75 Celsius
Thermal Temp. 1 Total Time: 58745
Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1479242 0 0x2015 0x4004 0x102c 0 0 -
1 1479241 0 0x2014 0x4004 0x102c 0 0 -
2 1479240 0 0xd010 0x4004 0x102c 0 0 -
3 1479239 0 0xc013 0x4004 0x102c 0 0 -
4 1479238 0 0xb011 0x4004 0x102c 0 0 -
5 1479237 0 0x8009 0x4004 0x102c 0 0 -
6 1479236 0 0x0015 0x4004 0x102c 0 0 -
7 1479235 0 0x0014 0x4004 0x102c 0 0 -
8 1479234 0 0xa011 0x4004 0x102c 0 0 -
9 1479233 0 0xa010 0x4004 0x102c 0 0 -
10 1479232 0 0x9012 0x4004 0x102c 0 0 -
11 1479231 0 0x9011 0x4004 0x102c 0 0 -
12 1479230 0 0x6000 0x4004 0x102c 0 0 -
13 1479229 0 0x5003 0x4004 0x102c 0 0 -
14 1479228 0 0x4001 0x4004 0x102c 0 0 -
15 1479227 0 0x4000 0x4004 0x102c 0 0 -
... (47 entries not read)
我也上传了输出nvme error-log /dev/nvme0n1
:
https://pastebin.com/SQJM7KhV
答案1
就我而言,这是由 Node Exporter(Prometheus)引起的。
停止进程后,该值Error Information Log Entries
停止增加。可能是它发出了 NVMe 驱动程序不支持的查询(必须深入挖掘)。
更新:我编辑了 hwmon 收集器代码以排除故障传感器:https://github.com/prometheus/node_exporter/issues/2643