我相信我的动力可能会消失,但我得到了矛盾的反馈。驱动器是一个XPG Gammix AGAMMIXS11P-1TT-C S11 Pro 3D NAND PCIe NVMe Gen3x4 M.2 2280 SSD 1To。我使用的是 Fedora(一开始是 34,然后在尝试解决问题时转向了 35。)
因此,几周以来,Input/output error
当我对相当大(5+GB)的备份文件进行哈希处理时,我一直收到 's 。dmesg
给我这样的条目:
BTRFS warning (device dm-0): csum failed root 256 ino 31359 off 70897819648 csum 0xc39e6daf expected csum 0xdd85c8f2 mirror 1
[ 4851.163157] BTRFS error (device dm-0): bdev /dev/mapper/luks-197f7c13-2430-4e53-bc76-2eb5a06a2419 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
这本身就很重要,我基本上一直像只读设备一样使用这台计算机,但最重要的是,我最终在一些随机文件(中的小配置或 lib 文件/usr/lib/
)上遇到了更多 btrfs 错误,并且 Firefox 停止工作。系统的其余部分都正常。我非常担心,所以我常常nvme-cli
从驱动器中获取智能日志。结果看起来(并且仍然看起来)不错:
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 43 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 23,088,142
data_units_written : 15,395,166
host_read_commands : 87,911,793
host_write_commands : 133,959,725
controller_busy_time : 2,823
power_cycles : 875
power_on_hours : 3,634
unsafe_shutdowns : 84
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 2 : 43 C
Temperature Sensor 3 : 59 C
Temperature Sensor 4 : 43 C
Temperature Sensor 5 : 43 C
Temperature Sensor 6 : 42 C
Thermal Management T1 Trans Count : 44
Thermal Management T2 Trans Count : 14
Thermal Management T1 Total Time : 899
Thermal Management T2 Total Time : 333
我决定全新安装 Fedora 35,安装过程运行良好。系统一直稳定。刚才我决定将我的备份(~180GB)写回驱动器并尝试对它们进行哈希处理,然后我Input/output error
又得到了一个。我运行了,btrfs scrub start /
但测试结果正常:
UUID: fd4449cc-ab1b-401c-8c62-916bd5e2353c
Scrub started: Sun Jan 9 19:31:55 2022
Status: finished
Duration: 0:00:57
Total to scrub: 182.23GiB
Rate: 3.20GiB/s
Error summary: no errors found
现在哈希值起作用了! (不Input/output error
,哈希值显示文件没有损坏。)
这是怎么回事?我的驱动器正在缓慢死亡吗?我可以运行更多测试(除了btrfs scrub
和nvme smart-log
)来找出答案吗?
编辑:我刚刚得到这些dmesg -w
:
[ 1654.979314] nvme nvme0: I/O 530 QID 12 timeout, aborting
[ 1654.979326] nvme nvme0: I/O 531 QID 12 timeout, aborting
[ 1654.979330] nvme nvme0: I/O 532 QID 12 timeout, aborting
[ 1654.979334] nvme nvme0: I/O 533 QID 12 timeout, aborting
[ 1654.979337] nvme nvme0: I/O 534 QID 12 timeout, aborting
[ 1671.274745] nvme nvme0: Abort status: 0x0
[ 1671.274767] nvme nvme0: Abort status: 0x0
[ 1671.274771] nvme nvme0: Abort status: 0x0
[ 1671.274774] nvme nvme0: Abort status: 0x0
[ 1671.274776] nvme nvme0: Abort status: 0x0
输出smartctl -a
:
=== START OF INFORMATION SECTION ===
Model Number: XPG GAMMIX S11 Pro
Serial Number: xxxxxxxxxxxx
Firmware Version: 32A0T2IA
PCI Vendor/Subsystem ID: 0x1cc1
IEEE OUI Identifier: 0x000000
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization: 204,128,706,560 [204 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Jan 10 12:35:38 2022 EST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0b): S/H_per_NS Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 75 Celsius
Critical Comp. Temp. Threshold: 80 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W - - 0 0 0 0 0 0
1 + 4.60W - - 1 1 1 1 0 0
2 + 3.80W - - 2 2 2 2 0 0
3 - 0.0450W - - 3 3 3 3 2000 2000
4 - 0.0040W - - 4 4 4 4 15000 15000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 23,571,578 [12.0 TB]
Data Units Written: 15,420,722 [7.89 TB]
Host Read Commands: 89,012,266
Host Write Commands: 134,091,234
Controller Busy Time: 2,832
Power Cycles: 878
Power On Hours: 3,639
Unsafe Shutdowns: 84
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 2: 41 Celsius
Temperature Sensor 3: 56 Celsius
Temperature Sensor 4: 41 Celsius
Temperature Sensor 5: 41 Celsius
Temperature Sensor 6: 40 Celsius
Thermal Temp. 1 Transition Count: 44
Thermal Temp. 2 Transition Count: 14
Thermal Temp. 1 Total Time: 899
Thermal Temp. 2 Total Time: 333
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
自测试结果(我在研究如何使用该工具时运行了两次快速测试):
Device Self Test Log for NVME device:nvme0
Current operation : 0
Current Completion : 0%
Self Test Result[0]:
Operation Result : 0
Self Test Code : 2
Valid Diagnostic Information : 0
Power on hours (POH) : 0xe3c
Vendor Specific : 0 0
Self Test Result[1]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0xe3c
Vendor Specific : 0 0
Self Test Result[2]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0xe3c
Vendor Specific : 0 0
Self Test Result[3]:
Operation Result : 0xf
Self Test Result[4]:
Operation Result : 0xf
Self Test Result[5]:
Operation Result : 0xf
Self Test Result[6]:
Operation Result : 0xf
Self Test Result[7]:
Operation Result : 0xf
Self Test Result[8]:
Operation Result : 0xf
Self Test Result[9]:
Operation Result : 0xf
Self Test Result[10]:
Operation Result : 0xf
Self Test Result[11]:
Operation Result : 0xf
Self Test Result[12]:
Operation Result : 0xf
Self Test Result[13]:
Operation Result : 0xf
Self Test Result[14]:
Operation Result : 0xf
Self Test Result[15]:
Operation Result : 0xf
Self Test Result[16]:
Operation Result : 0xf
Self Test Result[17]:
Operation Result : 0xf
Self Test Result[18]:
Operation Result : 0xf
Self Test Result[19]:
Operation Result : 0xf
如果我dmesg | grep -i nvme
在重新启动后运行,我不会得到任何关于:
[ 1.381334] nvme nvme0: pci function 0000:01:00.0
[ 1.392743] nvme nvme0: 15/0/0 default/read/poll queues
[ 1.394601] nvme0n1: p1 p2 p3
[ 19.943676] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
答案1
早在 2022 年,这个问题有一段时间没有得到解答,我忘记了,但后来我实际上最终发现了“驱动器”出了什么问题。因此,对于可能偶然发现这个问题的人来说:
驱动器很好,但系统之一内存条有问题。经过一些 memtest86 测试证明了这一点,我得到了保修更换,并且再也没有遇到过校验和故障或驱动器或 BTRFS 的任何问题。
更多细节:我在该版本上有 2 条 16 GB RAM,这对于它所看到的工作负载来说是相当大的杀伤力。由于故障棒是第二根,所以系统很少使用它。这解释了为什么整个系统是稳定的。
我会在几乎所有大文件哈希上遇到错误,但在较小的文件操作上也会间歇性地出现错误(到处都有损坏的文件)。我认为发生这种情况是因为大多数小文件操作都使用第一根 RAM,并且很少被其他应用程序使用完第一个 16 GB RAM 的情况“推送”到第二根 RAM。大文件哈希是问题最明显的迹象,因为它们是系统上实际有用的超过 16 GB RAM 的少数用例之一。例如,专业的视频编辑器可能比我更快地拒绝这个系统(他们经常使用处理大文件的软件)。