ZFS 清理不断修复磁盘

2024-6-2 • tag-icon

我在 ZFS 上有两个采用 RAID 1（镜像）配置的 SSD。它们相当老旧（我猜大概有 10 年了），但这些年来从未使用过。这是我的配置

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Nov 22 15:56:15 2023
    176G scanned at 454M/s, 28.3G issued at 73.2M/s, 176G total
    4.50K repaired, 16.11% done, 00:34:21 to go
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda     ONLINE       0     0     3  (repairing)
        sdb     ONLINE       0     0     6  (repairing)

如您所见，在清理过程中，它发现了一些校验和不一致，并能够修复它们。奇怪的是，即使我没有在磁盘上写入任何新内容并运行两次清理操作，一次接一次完成，它总会在两个磁盘上发现新的错误。

查看的输出dmesg，我没有看到与磁盘相关的问题（没有可怕的红色错误）。我唯一发现的是这个

[18125.949842] RIP: 0033:0x7f5eb2eeee83
[18125.949849] RSP: 002b:00007f5eb21fc6f8 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
[18125.949859] RAX: ffffffffffffffda RBX: 00007f5ea40178d0 RCX: 00007f5eb2eeee83
[18125.949865] RDX: 0000000000008000 RSI: 00007f5ea40178d0 RDI: 0000000000000009
[18125.949870] RBP: 00007f5ea40178a4 R08: 0000000000000007 R09: 00007f5ea4007650
[18125.949876] R10: 3ade3c6b4360070e R11: 0000000000000293 R12: ffffffffffffff50
[18125.949882] R13: 0000000000000000 R14: 00007f5ea40178a0 R15: 00007f5eb21fcbf0
[18125.949894]  </TASK>
[18125.949898] INFO: task fish:591217 blocked for more than 120 seconds.
[18125.949906]       Tainted: P           OE      6.1.0-13-amd64 #1 Debian 6.1.55-1
[18125.949914] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[18125.949919] task:fish            state:D stack:0     pid:591217 ppid:584012 flags:0x00000002
[18125.949930] Call Trace:
[18125.949933]  <TASK>
[18125.949939]  __schedule+0x351/0xa20
[18125.949954]  schedule+0x5d/0xe0
[18125.949961]  io_schedule+0x42/0x70
[18125.949969]  cv_wait_common+0xaa/0x130 [spl]
[18125.950003]  ? cpuusage_read+0x10/0x10
[18125.950014]  txg_wait_synced_impl+0xcb/0x110 [zfs]
[18125.950417]  txg_wait_synced+0xc/0x40 [zfs]
[18125.950812]  dmu_tx_wait+0x208/0x430 [zfs]
[18125.951127]  dmu_tx_assign+0x15e/0x510 [zfs]
[18125.951442]  zfs_dirty_inode+0x14d/0x360 [zfs]
[18125.951863]  zpl_dirty_inode+0x25/0x40 [zfs]
[18125.952277]  __mark_inode_dirty+0x53/0x380
[18125.952289]  touch_atime+0x1d1/0x1f0
[18125.952299]  iterate_dir+0xff/0x1c0
[18125.952309]  __x64_sys_getdents64+0x84/0x120
[18125.952318]  ? compat_filldir+0x190/0x190
[18125.952330]  do_syscall_64+0x58/0xc0
[18125.952342]  ? fpregs_assert_state_consistent+0x22/0x50
[18125.952352]  ? exit_to_user_mode_prepare+0x40/0x1d0
[18125.952362]  ? syscall_exit_to_user_mode+0x27/0x40
[18125.952370]  ? do_syscall_64+0x67/0xc0
[18125.952380]  ? do_syscall_64+0x67/0xc0
[18125.952391]  entry_SYSCALL_64_after_hwframe+0x64/0xce

在我看来，这确实是 zfs 相关内容的转储（非常模糊），尽管被列为阻止的任务是 fish（我的 shell）。这可能是我的问题（我不这么认为），还是这仅仅意味着我的磁盘有故障并且即将损坏？

如果有帮助的话，我在 Debian 12 Linux 机器上。

提前感谢你的帮助；-)