gentoo 上的 ZFS RAIDZ1 根在磁盘交换后无法启动

gentoo 上的 ZFS RAIDZ1 根在磁盘交换后无法启动

我有一个带有 3 个 HDD 的 3 磁盘 RAIDZ1 阵列:

# zpool status
...
config:

    NAME        STATE     READ WRITE CKSUM
    gpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sda     ONLINE       0     0     0

(当我创建池时我使用了/dev/disk/by-id路径但它们显示为/dev/sdx)。

我想将所有 3 个 HDD 都换成 SSD,但要逐步进行。由于我有 6 个 SATA 插槽和一根额外的电缆,我首先插入了一个新的 SSD 并进行设置,使用要替换的磁盘作为源:

    # sgdisk --replicate=/dev/disk/by-id/newSSD1 /dev/disk/by-id/oldHDD1
        The operation has completed successfully.
    # sgdisk --randomize-guids /dev/disk/by-id/newSSD1
        The operation has completed successfully.
    # grub-install /dev/disk/by-id/newSSD1
        Installing for i386-pc platform.
        Installation finished. No error reported.

然后fdisk -l /dev/disk/by-id/newSSD1向我展示了分区与 3 个硬盘相同,这意味着:

        Disk /dev/disk/by-id/newSSD1: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
        Disk model: CT1000MX500SSD1 
        Units: sectors of 1 * 512 = 512 bytes
        Sector size (logical/physical): 512 bytes / 4096 bytes
        I/O size (minimum/optimal): 4096 bytes / 4096 bytes
        Disklabel type: gpt
        Disk identifier: EF97564D-490F-4A76-B0F0-4E8C7CAFFBD2

        Device                                                      Start        End    Sectors   Size Type
        /dev/disk/by-id/newSSD1-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
        /dev/disk/by-id/newSSD1-part2         48       2047       2000  1000K BIOS boot
        /dev/disk/by-id/newSSD1-part9 1953507328 1953523711      16384     8M Solaris reserved 1

        Partition table entries are not in disk order.

然后我继续更换磁盘:

    # zpool offline gpool /dev/sdb
    # zpool status
          pool: gpool
         state: DEGRADED
        status: One or more devices has been taken offline by the administrator.
            Sufficient replicas exist for the pool to continue functioning in a
            degraded state.
        action: Online the device using 'zpool online' or replace the device with
            'zpool replace'.
          scan: scrub repaired 0B in 0 days 00:30:46 with 0 errors on Sat Jun 27 12:29:56 2020
        config:

            NAME        STATE     READ WRITE CKSUM
            gpool       DEGRADED     0     0     0
              raidz1-0  DEGRADED     0     0     0
                sdb     OFFLINE      0     0     0
                sdd     ONLINE       0     0     0
                sda     ONLINE       0     0     0

        errors: No known data errors
    
    # zpool replace gpool /dev/sdb /dev/disk/by-id/newSSD1
    Make sure to wait until resilver is done before rebooting.

    # zpool status
      pool: gpool
     state: DEGRADED
    status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Thu Jul 16 20:00:58 2020
        427G scanned at 6.67G/s, 792M issued at 12.4M/s, 574G total
        0B resilvered, 0.13% done, 0 days 13:10:03 to go
    config:

        NAME                                    STATE     READ WRITE CKSUM
        gpool                                   DEGRADED     0     0     0
          raidz1-0                              DEGRADED     0     0     0
            replacing-0                         DEGRADED     0     0     0
              sdb                               OFFLINE      0     0     0
              ata-newSSD1                       ONLINE       0     0     0
            sdd                                 ONLINE       0     0     0
            sda                                 ONLINE       0     0     0

    errors: No known data errors

最终,它恢复了银色。

    # zpool status
      pool: gpool
     state: ONLINE
    status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
    action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
      scan: resilvered 192G in 0 days 00:27:48 with 0 errors on Thu Jul 16 20:28:46 2020
    config:

        NAME                                  STATE     READ WRITE CKSUM
        gpool                                 ONLINE       0     0     0
          raidz1-0                            ONLINE       0     0     0
            ata-SSD1                          ONLINE       0     0     0
            sdd                               ONLINE       0     0     0
            sda                               ONLINE       0     0     0

    errors: No known data errors

这次带有by-id标签。由于我复制了分区并在新 SSD 上安装了 GRUB,所以我没想到会有什么麻烦。

但是,当我启动时,GRUB 让我进入grub rescue>提示符,因为grub_file_filters not found。我尝试从其他 2 个硬盘和 SSD 启动,每次都出现相同的错误。重新插入第 3 个硬盘,结果相同。

今天我从 SSD 启动了……一切正常。zpool 一切正常,没有 grub 错误。我正在这个系统上写这篇文章。

ls在救援提示符上确实显示了预期的一堆分区,但我只能让 GRUB 在一次 i insmod zfs(或类似) 时显示有意义的信息。但是,尝试ls类似操作(hd0,gpt1)/ROOT/gentoo@/boot会导致compression algorithm 73 not supported(或 80 也)。

我运行的是内核 5.4.28,附带 initramfs 和root=ZFSGRUB 参数。在我决定更换驱动器之前,我没有遇到任何与 ZFS 根启动相关的事件。我的/etc/default/grub有条目可以找到 ZFS 根,

GRUB_CMDLINE_LINUX_DEFAULT="dozfs spl.spl_hostid=0xa8c06101 real_root=ZFS=gpool/ROOT/gentoo"

确实如此。我想继续更换其他磁盘,但我更想知道发生了什么以及如何避免这种情况。

编辑1

我注意到了一些东西。运行后sgdisk --replicate,我得到了 3 个分区,与原始磁盘相同:

# fdisk -l ${NEWDISK2}
Disk /dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 2190C74D-46C8-44AC-81FB-36C3B72A7EA7

Device                                                      Start        End    Sectors   Size Type
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part2         48       2047       2000  1000K BIOS boot
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part9 1953507328 1953523711      16384     8M Solaris reserved 1

Partition table entries are not in disk order.

...但运行后zpool replace我丢失了一个分区:

# fdisk -l ${NEWDISK}
Disk /dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 0FC0A6C0-F9F1-E341-B7BD-99D7B370D685

Device                                                      Start        End    Sectors   Size Type
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part9 1953507328 1953523711      16384     8M Solaris reserved 1

...启动分区。考虑到我设法从新的 SSD 启动,这很奇怪。

我将继续进行实验。至于 ZFS 版本:

# zpool version
zfs-0.8.4-r1-gentoo
zfs-kmod-0.8.3-r0-gentoo

编辑2

这是一致的。当我复制时,sgdisk --replicate我得到了 3 个分区作为它们的原始分区,包括 BIOS 启动分区。运行zpool replace和重新同步后,我丢失了启动分区。

我认为系统仍可启动,因为该分区的数据仍在 MBR 中,所以 BIOS 可以启动 GRUB。

目前的状态如下:

# zpool status
  pool: gpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: resilvered 192G in 0 days 00:08:04 with 0 errors on Fri Jul 17 21:04:54 2020
config:

    NAME                            STATE     READ WRITE CKSUM
    gpool                           ONLINE       0     0     0
      raidz1-0                      ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD1 ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD2 ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD3 ONLINE       0     0     0

errors: No known data errors

相关内容