可疑的 zpool resilient

2024-6-8 • tag-icon

Linux 上的 ZFS，Ubuntu 16.04LTS。ZFS 池，带有 5x4TB 驱动器的 raidz1 vdev。

昨天我发现一个驱动器完全坏了。（磁头试图重置、重新校准时发出可听见的噪音）。它坏了。所以我把它离线了。

mrenouf@archive:~$ sudo zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: resilvered 2.42G in 0h3m with 0 errors on Thu Apr 20 08:04:09 2017
config:

    NAME                                       STATE     READ WRITE CKSUM
    tank                                       DEGRADED     0     0     0
      raidz1-0                                 DEGRADED     0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZH6V-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0Z9EG-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZJZS-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZDDJ-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZJDQ-part2  OFFLINE      0     0     0

errors: No known data errors

旁注：为什么是 -part2？FreeNAS 就是这样做的（这个池最初就是在这里创建的）。每个驱动器在数据前面都有一个 2GB 的交换分区，我决定最好在替换驱动器上复制该分区以实现对称性。

所以无论如何，我弹出一个替代品并开始重新镀银。

zpool replace tank ata-ST4000DM005-2DP166_ZDH0ZJDQ-part2 /dev/disk/by-id/ata-ST4000DM005-2DP166_ZDH15ZE0-part2

它以创纪录的时间“完成”……（显示估计时间为 20 小时）。我没有中间步骤的历史记录，但相信我……此池中分配了 ~4TB。

pool: tank
state: ONLINE
scan: resilvered 2.42G in 0h3m with 0 errors on Thu Apr 20 08:04:09 2017
  config:

    NAME                                       STATE     READ WRITE CKSUM
    tank                                       ONLINE       0     0     0
      raidz1-0                                 ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZH6V-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0Z9EG-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZJZS-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH0ZDDJ-part2  ONLINE       0     0     0
        ata-ST4000DM005-2DP166_ZDH15ZE0-part2  ONLINE       0     0     0

errors: No known data errors

我不相信你！

什么原因导致这种情况？我现在该怎么办？我还没搞清楚如何用驱动器本身替换驱动器（并强制进行另一次重新镀银）。

编辑：

我对这个新的“重新镀银”进行了清理：

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Apr 20 08:39:31 2017
    12.1G scanned out of 4.29T at 87.7M/s, 14h13m to go
    159M repaired, 0.27% done
config:

        NAME                                       STATE     READ WRITE CKSUM
        tank                                       ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZH6V-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0Z9EG-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZJZS-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZDDJ-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH15ZE0-part2  ONLINE       0     0 20.2K  (repairing)

errors: No known data errors

它似乎在~12G 标记处停滞：

12.4G scanned out of 4.29T at 64.2M/s, 19h25m to go

然后它就停止了：

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 164M in 0h3m with 0 errors on Thu Apr 20 08:42:50 2017
config:

        NAME                                       STATE     READ WRITE CKSUM
        tank                                       ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZH6V-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0Z9EG-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZJZS-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH0ZDDJ-part2  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH15ZE0-part2  ONLINE       0     0 21.5K

errors: No known data errors

什么？它甚至没有扫描整个池。怎么可能？没有硬件错误，那到底是怎么回事？啊。

答案1

您需要升级到 Linux 0.7.0 上的 ZFS（我的问题已在候选版本 3 中修复，现已可用这里）。跟随编译说明。安装后，请检查 zfs 和 spl 版本以确保一切正常。

modinfo zfs | grep -iw version
modinfo spl | grep -iw version

我遇到了完全一样的问题，这个方法对我有用。

答案1

相关内容