BTRS 错误是否总是意味着驱动器即将失效？

2024-6-10 • tag-icon

我相信我的动力可能会消失，但我得到了矛盾的反馈。驱动器是一个XPG Gammix AGAMMIXS11P-1TT-C S11 Pro 3D NAND PCIe NVMe Gen3x4 M.2 2280 SSD 1To。我使用的是 Fedora（一开始是 34，然后在尝试解决问题时转向了 35。）

因此，几周以来，Input/output error当我对相当大（5+GB）的备份文件进行哈希处理时，我一直收到 's 。dmesg给我这样的条目：

BTRFS warning (device dm-0): csum failed root 256 ino 31359 off 70897819648 csum 0xc39e6daf expected csum 0xdd85c8f2 mirror 1
[ 4851.163157] BTRFS error (device dm-0): bdev /dev/mapper/luks-197f7c13-2430-4e53-bc76-2eb5a06a2419 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

这本身就很重要，我基本上一直像只读设备一样使用这台计算机，但最重要的是，我最终在一些随机文件（中的小配置或 lib 文件/usr/lib/）上遇到了更多 btrfs 错误，并且 Firefox 停止工作。系统的其余部分都正常。我非常担心，所以我常常nvme-cli从驱动器中获取智能日志。结果看起来（并且仍然看起来）不错：

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning            : 0
temperature                 : 43 C
available_spare             : 100%
available_spare_threshold   : 10%
percentage_used             : 0%
endurance group critical warning summary: 0
data_units_read             : 23,088,142
data_units_written          : 15,395,166
host_read_commands          : 87,911,793
host_write_commands         : 133,959,725
controller_busy_time        : 2,823
power_cycles                : 875
power_on_hours              : 3,634
unsafe_shutdowns            : 84
media_errors                : 0
num_err_log_entries         : 0
Warning Temperature Time    : 0
Critical Composite Temperature Time : 0
Temperature Sensor 2        : 43 C
Temperature Sensor 3        : 59 C
Temperature Sensor 4        : 43 C
Temperature Sensor 5        : 43 C
Temperature Sensor 6        : 42 C
Thermal Management T1 Trans Count   : 44
Thermal Management T2 Trans Count   : 14
Thermal Management T1 Total Time    : 899
Thermal Management T2 Total Time    : 333

我决定全新安装 Fedora 35，安装过程运行良好。系统一直稳定。刚才我决定将我的备份（~180GB）写回驱动器并尝试对它们进行哈希处理，然后我Input/output error又得到了一个。我运行了，btrfs scrub start /但测试结果正常：

UUID:             fd4449cc-ab1b-401c-8c62-916bd5e2353c
Scrub started:    Sun Jan  9 19:31:55 2022
Status:           finished
Duration:         0:00:57
Total to scrub:   182.23GiB
Rate:             3.20GiB/s
Error summary:    no errors found

现在哈希值起作用了！（不Input/output error，哈希值显示文件没有损坏。）

这是怎么回事？我的驱动器正在缓慢死亡吗？我可以运行更多测试（除了btrfs scrub和nvme smart-log）来找出答案吗？

编辑：我刚刚得到这些dmesg -w：

[ 1654.979314] nvme nvme0: I/O 530 QID 12 timeout, aborting
[ 1654.979326] nvme nvme0: I/O 531 QID 12 timeout, aborting
[ 1654.979330] nvme nvme0: I/O 532 QID 12 timeout, aborting
[ 1654.979334] nvme nvme0: I/O 533 QID 12 timeout, aborting
[ 1654.979337] nvme nvme0: I/O 534 QID 12 timeout, aborting
[ 1671.274745] nvme nvme0: Abort status: 0x0
[ 1671.274767] nvme nvme0: Abort status: 0x0
[ 1671.274771] nvme nvme0: Abort status: 0x0
[ 1671.274774] nvme nvme0: Abort status: 0x0
[ 1671.274776] nvme nvme0: Abort status: 0x0

输出smartctl -a：

=== START OF INFORMATION SECTION ===
Model Number:                       XPG GAMMIX S11 Pro
Serial Number:                      xxxxxxxxxxxx
Firmware Version:                   32A0T2IA
PCI Vendor/Subsystem ID:            0x1cc1
IEEE OUI Identifier:                0x000000
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            204,128,706,560 [204 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Jan 10 12:35:38 2022 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0b):         S/H_per_NS Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    23,571,578 [12.0 TB]
Data Units Written:                 15,420,722 [7.89 TB]
Host Read Commands:                 89,012,266
Host Write Commands:                134,091,234
Controller Busy Time:               2,832
Power Cycles:                       878
Power On Hours:                     3,639
Unsafe Shutdowns:                   84
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               41 Celsius
Temperature Sensor 3:               56 Celsius
Temperature Sensor 4:               41 Celsius
Temperature Sensor 5:               41 Celsius
Temperature Sensor 6:               40 Celsius
Thermal Temp. 1 Transition Count:   44
Thermal Temp. 2 Transition Count:   14
Thermal Temp. 1 Total Time:         899
Thermal Temp. 2 Total Time:         333

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

自测试结果（我在研究如何使用该工具时运行了两次快速测试）：

Device Self Test Log for NVME device:nvme0
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0xe3c
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0xe3c
  Vendor Specific              : 0 0
Self Test Result[2]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0xe3c
  Vendor Specific              : 0 0
Self Test Result[3]:
  Operation Result             : 0xf
Self Test Result[4]:
  Operation Result             : 0xf
Self Test Result[5]:
  Operation Result             : 0xf
Self Test Result[6]:
  Operation Result             : 0xf
Self Test Result[7]:
  Operation Result             : 0xf
Self Test Result[8]:
  Operation Result             : 0xf
Self Test Result[9]:
  Operation Result             : 0xf
Self Test Result[10]:
  Operation Result             : 0xf
Self Test Result[11]:
  Operation Result             : 0xf
Self Test Result[12]:
  Operation Result             : 0xf
Self Test Result[13]:
  Operation Result             : 0xf
Self Test Result[14]:
  Operation Result             : 0xf
Self Test Result[15]:
  Operation Result             : 0xf
Self Test Result[16]:
  Operation Result             : 0xf
Self Test Result[17]:
  Operation Result             : 0xf
Self Test Result[18]:
  Operation Result             : 0xf
Self Test Result[19]:
  Operation Result             : 0xf

如果我dmesg | grep -i nvme在重新启动后运行，我不会得到任何关于：

[    1.381334] nvme nvme0: pci function 0000:01:00.0
[    1.392743] nvme nvme0: 15/0/0 default/read/poll queues
[    1.394601]  nvme0n1: p1 p2 p3
[   19.943676] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.

答案1

早在 2022 年，这个问题有一段时间没有得到解答，我忘记了，但后来我实际上最终发现了“驱动器”出了什么问题。因此，对于可能偶然发现这个问题的人来说：

驱动器很好，但系统之一内存条有问题。经过一些 memtest86 测试证明了这一点，我得到了保修更换，并且再也没有遇到过校验和故障或驱动器或 BTRFS 的任何问题。

更多细节：我在该版本上有 2 条 16 GB RAM，这对于它所看到的工作负载来说是相当大的杀伤力。由于故障棒是第二根，所以系统很少使用它。这解释了为什么整个系统是稳定的。

我会在几乎所有大文件哈希上遇到错误，但在较小的文件操作上也会间歇性地出现错误（到处都有损坏的文件）。我认为发生这种情况是因为大多数小文件操作都使用第一根 RAM，并且很少被其他应用程序使用完第一个 16 GB RAM 的情况“推送”到第二根 RAM。大文件哈希是问题最明显的迹象，因为它们是系统上实际有用的超过 16 GB RAM 的少数用例之一。例如，专业的视频编辑器可能比我更快地拒绝这个系统（他们经常使用处理大文件的软件）。

答案1

相关内容