如何在完成备用替换后修复 ZFS 池或如何纠正备用替换

2024-6-1 • tag-icon

我有一个当前状态的 ZFS 池：

[root@SERVER-abc ~]# zpool status -v DATAPOOL
  pool: DATAPOOL
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 18.5M in 00:00:01 with 0 errors on Wed Jan  5 19:10:50 2022
config:`

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0    17  too many errors
            spare-1                                       ONLINE       0     0    17
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    30
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    29
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87@auto-2022-01-04_11-41:<0x1>
        <0x1080a>:<0x1>
        <0x182a>:<0x1>
        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87:<0x1>
        <0x16fa>:<0x1>

这是具有 4 + 1 个备用驱动器的 zpool。发生了一些事情，备用驱动器突然自动与另一个驱动器配对为备用驱动器 1。

这对我来说是出乎意料的，因为：

为什么备用驱动器没有取代性能下降的驱动器？
如何找出备用块跳转到备用块-1的原因？
是否有可能（或甚至建议/可能）取回备用驱动器然后更换性能下降的驱动器？

目标是在无需从备份中获取大量数据的情况下挽救池，但本质上我想了解发生了什么以及为什么发生。以及如何按照“最佳实践”处理这些情况。

非常感谢！:)

系统为：SuperMicro，TrueNAS-12.0-U4.1，zfs-2.0.4-3

编辑：将输出从 zpool status -x 更改为 zpool status -v DATAPOOL

编辑2：到目前为止，我了解到第一个168342c5似乎有错误并且备用（1bfaa607）跳了进来。之后14c707c6也降级了。

编辑3，附加问题：所有驱动器（备用 1 中的驱动器除外）似乎都有 CKSUM 错误 - 这说明什么？布线？HBA？所有驱动器同时坏了？

最新更新，之后zpool clear似乎zpool scrub DATAPOOL很清楚，很多事情已经发生并且没有办法拯救游泳池：

  pool: DATAPOOL
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan  6 16:18:05 2022
        1.82T scanned at 1.55G/s, 204G issued at 174M/s, 7.82T total
        40.8G resilvered, 2.55% done, 12:44:33 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   156  too many errors
            spare-1                                       DEGRADED     0     0     0
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  DEGRADED     0     0   236  too many errors
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0  (resilvering)
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   182  too many errors
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   179  too many errors
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

我现在要检查所有智能统计数据。

答案1

这是一个 4 盘 RAIDZ2 吗？

您是否选择了该布局而不是 ZFS 镜像？

你能展示一下输出吗zpool status -v？

请运行zpool clear并跟踪结果/进度。

答案1

相关内容