zfs raidz-2 如何从 3 个驱动器故障中恢复?

zfs raidz-2 如何从 3 个驱动器故障中恢复?

我想知道发生了什么,ZFS 是如何完全恢复的,或者我的数据是否仍然完好无损。
昨晚我进来时看到这个让我感到沮丧,然后感到困惑。

zpool status
  pool: san
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 392K in 0h0m with 0 errors on Tue Jan 21 16:36:41 2020
config:

        NAME                                          STATE     READ WRITE CKSUM
        san                                           DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346  ONLINE       0     0     0
            ata-ST2000DM001-9YN164_W1E07E0G           DEGRADED     0     0    38  too many errors
            ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332  DEGRADED     0     0    63  too many errors
            ata-ST2000NM0011_Z1P07NVZ                 ONLINE       0     0     0
            ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344  ONLINE       0     0     0
            wwn-0x50014ee20949b6f9                    DEGRADED     0     0    75  too many errors

errors: No known data errors 

怎么可能没有数据错误,而且整个池都没有故障?

一个驱动器sdf的 smartctl 测试对 SMART 失败read fail,其他驱动器的问题稍微小一些;无法纠正/待处理的扇区或 UDMA CRC 错误。

我尝试将每个故障驱动器切换为离线状态,然后逐个重新联机,但这没有帮助。

    $ zpool status
  pool: san
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 392K in 0h0m with 0 errors on Tue Jan 21 16:36:41 2020
config:

        NAME                                          STATE     READ WRITE CKSUM
        san                                           DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346  ONLINE       0     0     0
            ata-ST2000DM001-9YN164_W1E07E0G           DEGRADED     0     0    38  too many errors
            ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332  OFFLINE      0     0    63
            ata-ST2000NM0011_Z1P07NVZ                 ONLINE       0     0     0
            ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344  ONLINE       0     0     0
            wwn-0x50014ee20949b6f9                    DEGRADED     0     0    75  too many errors

因此,感觉非常幸运,或者有点困惑我的数据是否还在那里,经过检查找到了最差的驱动器后,我用唯一的备用驱动器进行了替换。

    $ zpool status
  pool: san
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jan 21 17:33:15 2020
        467G scanned out of 8.91T at 174M/s, 14h10m to go
        77.6G resilvered, 5.12% done
config:

        NAME                                              STATE     READ WRITE CKSUM
        san                                               DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346      ONLINE       0     0     0
            replacing-1                                   DEGRADED     0     0     0
              ata-ST2000DM001-9YN164_W1E07E0G             OFFLINE      0     0    38
              ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P1171516  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332      DEGRADED     0     0    63  too many errors
            ata-ST2000NM0011_Z1P07NVZ                     ONLINE       0     0     0
            ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344      ONLINE       0     0     0
            wwn-0x50014ee20949b6f9                        DEGRADED     0     0    75  too many errors

重新镀银确实成功完成。

$ zpool status
  pool: san
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 1.48T in 12h5m with 0 errors on Wed Jan 22 05:38:48 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        san                                             DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            ata-WDC_WD20EZRX-00DC0B0_WD-WMC1T3458346    ONLINE       0     0     0
            ata-WDC_WD2000FYYZ-01UL1B1_WD-WCC1P1171516  ONLINE       0     0     0
            ata-WDC_WD20EZRX-19D8PB0_WD-WCC4M0428332    DEGRADED     0     0    63  too many errors
            ata-ST2000NM0011_Z1P07NVZ                   ONLINE       0     0     0
            ata-WDC_WD20EARX-00PASB0_WD-WCAZAJ490344    ONLINE       0     0     0
            wwn-0x50014ee20949b6f9                      DEGRADED     0     0    75  too many errors

我现在正处于十字路口。我通常dd会将故障驱动器的前 2MB 清零,然后用其自身替换,这样做没问题,但是如果确实有数据丢失,我可能需要最后两个卷来恢复数据。

我现在把它sdf放在我的桌子上,已经拿走了。我觉得,在最坏的情况下,我可以用它来帮助恢复。

与此同时,我认为我现在要将降级驱动器的前几个 MB 进行 dev/zero,并用其自身进行替换,我认为事情应该会顺利进行,对第二个故障驱动器进行冲洗并重复此操作,直到我手头上有一些替代品。

问题 发生了什么事,池是如何坚持下去的,或者我是否遗漏了一些数据(鉴于 zfs 及其报告的完整性,这一点令人怀疑)

这可能是由于幸运的故障顺序造成的,例如不是堆栈顶部的驱动器发生故障??

问题 这仅供参考,与主题无关。是什么导致这 3 个驱动器同时发生故障?我认为是擦洗导致了故障。我前一天晚上检查过,所有驱动器都在线。

请注意,布线最近一直存在问题,办公室晚上很冷,但这些问题只是暂时的drive unavailable,而不是校验和错误。我想那不是布线问题,而是驱动器老化,它们已经 5 年了。但一天发生 3 次故障?拜托,这足以吓到我们很多人了!

答案1

RAID-Z2 是双重奇偶校验,冗余度与 RAID 6 类似。两个磁盘可能完全失效,数据可通过奇偶校验恢复。假设阵列的其余部分运行正常。

您不一定遇到了 I/O 错误。DEGRADED 表示 ZFS 继续使用磁盘,尽管存在校验和错误。可能是因为一些位翻转,但驱动器仍能正常工作。每链接从该输出:

运行“zpool status -x”来确定哪个池出现了错误。

查找 READ、WRITE 或 CKSUM 错误计数不为零的设备。这表明该设备遇到了读取 I/O 错误、写入 I/O 错误或校验和验证错误。由于该设备是镜像或 RAID-Z 设备的一部分,因此 ZFS 能够从错误中恢复并随后修复受损数据。

如果这些错误持续一段时间,ZFS 可能会确定设备有故障并将其标记为故障。但是,这些错误计数可能表明或可能不表明设备不可用。

关于驱动器健康状况:

可能是硬盘老化了,已经 5 年了。但一天就发生 3 次故障?拜托,这足以吓到我们很多人了!

立即备份恢复测试重要数据。来自不同的介质,而不是此阵列。

更换继续降级的驱动器。如果内核在系统日志中报告 I/O 错误,则绝对要更换。如果在保修或支持合同期内,请利用这一点。如果过了保修期,制造商打赌它们不会持续这么长时间,所以要考虑这一点。

相关内容