添加 tune2fs 设置 UUID 后 ext4 文件系统损坏？

2024-6-10 • tag-icon

我们刚刚在 AWS 上发布了 Ubuntu 16.04 系统的新版本。除了 apt-get 更新等功能外，我们还在 puppet 代码中添加了一个明确的步骤，以使用 tune2fs 将 UUID 添加到 ext4 磁盘。（这是为了迁移到使用 nvme 设备名称的 amazon c5 实例类型做准备，我们想知道前后哪个磁盘是哪个。）

但随后我们需要重新启动大量系统以在 AWS（同一实例系列）中对其进行实例大小调整，其中约 10% 的系统因其数据驱动器（而非根驱动器）上的文件系统损坏而失败。

grep -i ext4 /var/log/kern.log |grep xvdh
2019-03-14T14:07:39.954930+00:00 ip-10-2-219-30 kernel: [   26.059585] 
EXT4-fs (xvdh): ext4_check_descriptors: Checksum for group 0 failed                 (25645!=13919)
2019-03-14T14:07:39.961718+00:00 ip-10-2-219-30 kernel: [   26.064303]         EXT4-fs (xvdh): group descriptors corrupted!
2019-03-14T14:07:54.984741+00:00 ip-10-2-219-30 kernel: [   41.090302] EXT4-fs (xvdh): ext4_check_descriptors: Checksum for group 0 failed (25645!=13919)
2019-03-14T14:07:54.984757+00:00 ip-10-2-219-30 kernel: [   41.094897] EXT4-fs (xvdh): group descriptors corrupted!
2019-03-14T14:08:17.138117+00:00 ip-10-2-219-30 kernel: [   63.239655] EXT4-fs (xvdh): ext4_check_descriptors: Checksum for group 0 failed (25645!=13919)
2019-03-14T14:08:17.138141+00:00 ip-10-2-219-30 kernel: [   63.246723] EXT4-fs (xvdh): group descriptors corrupted!
2019-03-14T14:21:30.636962+00:00 redacted1 kernel: [    3.798075] EXT4-fs (xvdh): mounted filesystem with ordered data mode. Opts: (null)
2019-03-14T14:46:07.812220+00:00 redacted2 kernel: [    3.614731] EXT4-fs (xvdh): mounted filesystem with ordered data mode. Opts: (null)

然后我们必须对驱动器进行 fsck 以恢复系统。

进行此更改的 puppet 代码如下。到目前为止，我们仅使用 M4/C4 实例类型，因此应该全部是 /dev/xvdh。

class our_storage::platforms::aws {

  # This shouldn't run during image generation.
  if $::packer_build != 'yes' {

    # If nvme0n1 is present this means we are using a M5 or C5 instance and then the data volume will be nvme1n1
    # We need to check the disk that are mounted in / because it might take time for the data volume to appear as totally mounted to the instance.
    # xvda    --> xvdh
    # nvme0n1 --> nvme1n1
    if $facts['disks']['nvme0n1'] {
      $st_volume = '/dev/nvme1n1'
    }
    elsif $facts['disks']['xvda'] {
      $st_volume = '/dev/xvdh'
    }
    else {
      fail("Invalid disk configuration ${facts['disks']}")
    }

    $fstype = 'ext4'
    $mount_opts = 'auto,noatime'

    # If /data is not mounted, go ahead and do it.
    if !$facts['mountpoints']['/data'] {

      # Get an unique, constant UUID for this volume.
      $ec2_userdata = parsejson($facts['ec2_userdata'])
      $domain = $ec2_userdata['domain']
      $subdomain = $ec2_userdata['subDomain']
      $st_volume_uuid = fqdn_uuid("${subdomain}.${domain}")

      # we may have to wait for the device to "appear"
      exec { 'Storage: waiting for data volume to be attached':
        path      => '/bin',
        command   => "lsblk -fn ${st_volume}",
        tries     => 60,
        try_sleep => 10,
        unless    => 'mountpoint -q -- "/data"',
        logoutput => true,
      } -> exec { 'Storage: formatting data volume': # WARNING: if we ever change from ext4, this will reformat volumes!
        path      => ['/sbin', '/bin'],
        command   => "mkfs.${fstype} -F ${st_volume}",
        unless    => "blkid ${st_volume} | grep -q 'TYPE=\"${fstype}\"'",
        logoutput => true,
      } -> exec { 'Storage: assign UUID to data volume':
        path      => ['/sbin', '/bin'],
        command   => "tune2fs ${st_volume} -U ${st_volume_uuid}",
        logoutput => true,
      } ~> mount { '/data':
        ensure  => mounted,
        device  => "UUID=${st_volume_uuid}",
        fstype  => $fstype,
        options => $mount_opts,
        require => File['/data'],
        before  => File[$our_storage::data_dirs],
      }
    } else {
      # Need to fetch the current UUID.
      # Cannot be changed if the volume is already mounted!
      $st_volume_uuid = $st_volume ? {
        '/dev/nvme1n1' => get_disk_uuid('/dev/nvme1n1'),
        '/dev/xvdh'    => get_disk_uuid('/dev/xvdh')
      }

      # If data is already mounted, just make sure that everything in fstab is in place.
      # e.g. it is using the UUID as disk identifier.
      mount { '/data':
        ensure  => mounted,
        device  => "UUID=${st_volume_uuid}",
        fstype  => $fstype,
        options => $mount_opts,
        require => File['/data'],
        before  => File[$our_storage::data_dirs],
      }
    }
  }
}

我们还不能弄清楚这个变化是否是罪魁祸首——它似乎是唯一相关的重大变化，但我们看不出它会如何破坏事物......我们可以关联的一件事是，这似乎有些偏向于繁忙的系统，其中我们安装的 EBS 驱动器合理地突发平衡耗尽，因此可能会很慢。

我们尝试在一系列开发系统上重现此问题，但未能引发相同的故障。

我知道我们可以自动化 fsck，但这有点像是在掩盖最初造成损害的东西；如果它造成的损害超过了 fsck 无人值守修复的范围，会发生什么？我们经营着一支庞大的舰队。

是否有任何已知的方法可以在缓慢或仍在安装的系统上执行 tune2fs 来损坏 ext4 文件系统，或者我们正在做的其他明显的事情是否会导致这种损坏？我们可以做些什么来确定它是否是？因为这是间歇性的不可重现的，并且还有其他更改（软件包更新和所有内容），我们不能确定 UUID 的添加是原因，但从时间上看肯定是可疑的。

相关内容