zpool 陷入重新同步循环

zpool 陷入重新同步循环

我有以下 zpool:

    NAME                        STATE     READ WRITE CKSUM
    zfspool                     ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
        wwn-0x5000cca266f1ae00  ONLINE       0     0     0

今天早上,主机发生了一个事件(仍在深入研究。负载非常高,很多东西无法正常工作,但我仍然可以进入)。

重新启动时,主机在启动过程中挂起,等待依赖于上述池中的数据的服务。

怀疑池有问题,我移除了其中一个驱动器并再次重新启动。这次主机上线了。

清理显示现有磁盘上的所有数据都正常。清理完成后,我重新插入了移除的驱动器。驱动器开始重新镀银,但只完成了约 4%,然后重新启动。

smartctl 显示两个驱动器均没有问题(没有记录错误,WHEN_FAILED 为空)。

但是,我无法分辨哪个磁盘正在重新镀银,事实上看起来池没有问题,根本不需要重新镀银。

errors: No known data errors
root@host1:/var/log# zpool status
  pool: zfspool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Dec  8 12:20:53 2019
        46.7G scanned at 15.6G/s, 45.8G issued at 15.3G/s, 5.11T total
        0B resilvered, 0.87% done, 0 days 00:05:40 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        zfspool                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
            wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

摆脱这种重新镀银循环的最佳方法是什么?其他答案建议分离正在重新镀银的驱动器,但就像我说的,看起来两者都不行。

编辑:

zpool events 大约是以下 1000 个重复项:

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "zfspool"
        pool_guid = 0x990e3eff72d0c352
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5ded4d64 0x1d7189a4
        eid = 0xf89

Dec  8 2019 13:22:12.493980068 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "zfspool"
        pool_guid = 0x990e3eff72d0c352
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "host1"
        history_internal_str = "func=2 mintxg=7381953 maxtxg=9049388"
        history_internal_name = "scan setup"
        history_txg = 0x8a192e
        history_time = 0x5ded4d64
        time = 0x5ded4d64 0x1d7189a4
        eid = 0xf8a

Dec  8 2019 13:22:17.485979213 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "zfspool"
        pool_guid = 0x990e3eff72d0c352
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "host1"
        history_internal_str = "errors=0"
        history_internal_name = "scan aborted, restarting"
        history_txg = 0x8a192f
        history_time = 0x5ded4d69
        time = 0x5ded4d69 0x1cf7744d
        eid = 0xf8b

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "zfspool"
        pool_guid = 0x990e3eff72d0c352
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "host1"
        history_internal_str = "errors=0"
        history_internal_name = "starting deferred resilver"
        history_txg = 0x8a192f
        history_time = 0x5ded4d69
        time = 0x5ded4d69 0x2bbfa222
        eid = 0xf8c

Dec  8 2019 13:22:17.733979170 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "zfspool"
        pool_guid = 0x990e3eff72d0c352
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5ded4d69 0x2bbfa222
        eid = 0xf8d

...

答案1

这个问题现已解决。

github上的以下问题提供了答案:

https://github.com/zfsonlinux/zfs/issues/9551

在这种情况下,危险信号可能是快速循环的"starting deferred resilver"事件,如下图所示zpool events -v

链接中的第一个建议是禁用 zfs-zed 服务。就我而言,它从一开始就没有启用。

第二个建议是验证 zpool 是否已激活 defer_resilver 功能。似乎在升级池时未启用与升级相对应的功能时存在潜在问题。此池在过去 2 年左右的时间里从多台机器/操作系统中移出,因此它可能是在较旧版本的 ZFS 中创建的,而在最新主机上使用较新版本的 ZFS,这是有道理的:

root@host1:/# zpool get all | grep feature
...
zfspool  feature@resilver_defer         disabled                       local
...

看到这个后,我启用了该功能。github 链接似乎暗示这很危险,所以一定要备份。

root@host1:/# zpool set feature@resilver_defer=enabled zfspool

此后,zpool status 显示重新同步的进度比以前更快:

root@host1:/# zpool status
  pool: zfspool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Dec  8 13:53:43 2019
        847G scanned at 2.03G/s, 396G issued at 969M/s, 5.11T total
        0B resilvered, 7.56% done, 0 days 01:25:14 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        zfspool                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca266f3d8ee  ONLINE       0     0     0
            wwn-0x5000cca266f1ae00  ONLINE       0     0     0

errors: No known data errors

相关内容