BTRFS RAID5 磁盘在平衡时发生故障

2024-6-19 • tag-icon

如果您点击了此主题的链接：谢谢！

我有以下设置：

6x 500GB HDD 驱动器
1x 32GB NVME-SSD（英特尔傲腾）

我使用 bcache 将 SSD 设置为缓存设备，其他六个驱动器均为备份设备。完成所有设置后，我使用 btrfs 在 RAID5 中格式化了六个 HHD。过去 7 个月，一切运行正常。

现在我有 6 个备用的 2TB HDD 驱动器，我想逐个替换旧的 500GB 磁盘。因此，我从第一个开始，将其从 btrfs 中删除。这很有效，我没有遇到任何问题。之后，我干净利落地将空磁盘从 bcache 中分离出来，一切仍然正常，因此我将其删除。以下是此操作的命令行：

sudo btrfs device delete /dev/bcacheX /media/raid
cat /sys/block/bcacheX/bcache/state
cat /sys/block/bcacheX/bcache/dirty_data
sudo sh -c "echo 1 > /sys/block/bcacheX/bcache/detach"
cat /sys/block/bcacheX/bcache/state

之后，我安装了一个 2TB 的驱动器，将其连接到 bcache 并将其添加到 raid 中。下一步是将数据平衡到新驱动器上。请参阅命令行：

sudo make-bcache -B /dev/sdY
sudo sh -c "echo '60a63f7c-2e68-4503-9f25-71b6b00e47b2' > /sys/block/bcacheY/bcache/attach"
sudo sh -c "echo writeback > /sys/block/bcacheY/bcache/cache_mode"
sudo btrfs device add /dev/bcacheY /media/raid
sudo btrfs fi ba start /media/raid/

平衡工作正常，直到将约 164GB 写入新驱动器，这大约是要平衡的数据的 50%。突然出现磁盘写入错误。Raid 慢慢变得不可用（我在平衡时运行了 3 个 RAID 的虚拟机）。我认为它工作了一段时间，因为 SSD 提交了写入。在某个时候，平衡停止了，我只能关闭虚拟机。我检查了磁盘上的 I/O，SSD 以恒定的 1.2 GB/s 读取速度发出。我认为 bcache 以某种方式将数据传送到 btrfs，它在那里被拒绝并再次请求，但这只是猜测。无论如何，我最终重置了主机，我物理断开了损坏的磁盘并放置了一个新的。我还在其上创建了一个 bcache 备份设备，并发出以下命令来替换故障磁盘：

sudo btrfs replace start -r 7 /dev/bcache5 /media/raid

文件系统需要以读/写方式挂载，此命令才能运行。它现在正在工作，但速度非常慢，大约 3.5 MB/s。不幸的是，系统日志报告了很多这样的消息：

...
scrub_missing_raid56_worker: 62 callbacks suppressed
BTRFS error (device bcache0): failed to rebuild valid logical 4929143865344 for dev (null)
...
BTRFS error (device bcache0): failed to rebuild valid logical 4932249866240 for dev (null)
scrub_missing_raid56_worker: 1 callbacks suppressed
BTRFS error (device bcache0): failed to rebuild valid logical 4933254250496 for dev (null)
....

如果我尝试从文件系统读取文件，则输出命令将因简单的 I/O 错误而失败，并且系统日志会显示类似以下的条目：

BTRFS warning (device bcache0): csum failed root 5 ino 1143 off 7274496 csum 0xccccf554 expected csum 0x6340b527 mirror 2

到目前为止，一切都很好（或很糟糕）。到目前为止，替换的 4.3% 大约花了 6 个小时。替换过程（“btrfs replace status”）没有报告任何读取或写入错误。我会让它继续下去，直到完成。在第一个 2TB 磁盘发生故障之前，根据“btrfs filesystem show”，已写入 164 GB 的数据。如果我检查写入新驱动器的数据量，4.3% 代表大约 82 GB（根据 /proc/diskstats）。我不知道如何解释这一点，但无论如何。

现在我的问题终于来了：如果替换命令成功完成，我下一步该做什么。清理？平衡？再备份一次？;-) 您觉得我在这个过程中做错了什么吗？btrfs 报告的警告和错误是否意味着数据丢失了？:-(

以下是一些附加信息（編輯)：

$ sudo btrfs fi sh
Total devices 7 FS bytes used 1.56TiB
Label: none  uuid: 9f765025-5354-47e4-afcc-a601b2a52703
devid    0 size 1.82TiB used 164.03GiB path /dev/bcache5
devid    1 size 465.76GiB used 360.03GiB path /dev/bcache4
devid    3 size 465.76GiB used 360.00GiB path /dev/bcache3
devid    4 size 465.76GiB used 359.03GiB path /dev/bcache1
devid    5 size 465.76GiB used 360.00GiB path /dev/bcache0
devid    6 size 465.76GiB used 360.03GiB path /dev/bcache2
*** Some devices missing

$ sudo btrfs dev stats /media/raid/
[/dev/bcache5].write_io_errs    0
[/dev/bcache5].read_io_errs     0
[/dev/bcache5].flush_io_errs    0
[/dev/bcache5].corruption_errs  0
[/dev/bcache5].generation_errs  0
[/dev/bcache4].write_io_errs    0
[/dev/bcache4].read_io_errs     0
[/dev/bcache4].flush_io_errs    0
[/dev/bcache4].corruption_errs  0
[/dev/bcache4].generation_errs  0
[/dev/bcache3].write_io_errs    0
[/dev/bcache3].read_io_errs     0
[/dev/bcache3].flush_io_errs    0
[/dev/bcache3].corruption_errs  0
[/dev/bcache3].generation_errs  0
[/dev/bcache1].write_io_errs    0
[/dev/bcache1].read_io_errs     0
[/dev/bcache1].flush_io_errs    0
[/dev/bcache1].corruption_errs  0
[/dev/bcache1].generation_errs  0
[/dev/bcache0].write_io_errs    0
[/dev/bcache0].read_io_errs     0
[/dev/bcache0].flush_io_errs    0
[/dev/bcache0].corruption_errs  0
[/dev/bcache0].generation_errs  0
[/dev/bcache2].write_io_errs    0
[/dev/bcache2].read_io_errs     0
[/dev/bcache2].flush_io_errs    0
[/dev/bcache2].corruption_errs  0
[/dev/bcache2].generation_errs  0
[devid:7].write_io_errs    9525186
[devid:7].read_io_errs     10136573
[devid:7].flush_io_errs    143
[devid:7].corruption_errs  0
[devid:7].generation_errs  0

$ sudo btrfs fi df /media/raid/
Data, RAID5: total=1.56TiB, used=1.55TiB
System, RAID1: total=64.00MiB, used=128.00KiB
Metadata, RAID1: total=4.00GiB, used=2.48GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$ uname -a
Linux hostname 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ btrfs --version
btrfs-progs v4.15.1

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"

再次感谢您的阅读并希望得到您的评论/答案！

编辑2

“设备替换”刚刚完成。它位于 9% 标记旁边，我认为这个百分比与驱动器上写入的数据量相匹配：总大小为 1.82 TiB，其中 164 GiB。因此 100% 意味着完整的 2TB 替换。因此，这里有一些额外的输出：

$ btrfs replace status -1 /media/raid/
Started on 30.Oct 08:16:53, finished on 30.Oct 21:05:22, 0 write errs, 0 uncorr. read errs

$ sudo btrfs fi sh
Label: none  uuid: 9f765025-5354-47e4-afcc-a601b2a52703
Total devices 6 FS bytes used 1.56TiB
devid    1 size 465.76GiB used 360.03GiB path /dev/bcache4
devid    3 size 465.76GiB used 360.00GiB path /dev/bcache3
devid    4 size 465.76GiB used 359.03GiB path /dev/bcache1
devid    5 size 465.76GiB used 360.00GiB path /dev/bcache0
devid    6 size 465.76GiB used 360.03GiB path /dev/bcache2
devid    7 size 1.82TiB used 164.03GiB path /dev/bcache5

读取文件仍然会因 I/O 错误而中止，系统日志仍然显示：

BTRFS warning (device bcache0): csum failed root 5 ino 1143 off 7274496 csum 0x98f94189 expected csum 0x6340b527 mirror 1
BTRFS warning (device bcache0): csum failed root 5 ino 1143 off 7274496 csum 0xccccf554 expected csum 0x6340b527 mirror 2

因此，我认为最无害的操作是只读清理，我刚刚启动了该过程。错误和警告充斥着系统日志：

$ sudo btrfs scrub start -BdrR /media/raid # -B no backgroud, -d statistics per device, -r read-only, -R raw statistics per device
$ tail -f /var/log/syslog
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2848, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590109331456 on dev /dev/bcache5, physical 2954104832, root 5, inode 418, offset 1030803456, length 4096, links 1 (path: VMs/Virtualbox/Windows 10 Imaging VMs/Windows 10 Imaging/Windows 10 Imaging-fixed.vdi)
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2849, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590108811264 on dev /dev/bcache5, physical 2953977856, root 5, inode 1533, offset 93051236352, length 4096, links 1 (path: VMs/Virtualbox/vmrbreb/vmrbreb-fixed.vdi)
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2850, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590109335552 on dev /dev/bcache5, physical 2954108928, root 5, inode 418, offset 1030807552, length 4096, links 1 (path: VMs/Virtualbox/Windows 10 Imaging VMs/Windows 10 Imaging/Windows 10 Imaging-fixed.vdi)
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2851, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590108815360 on dev /dev/bcache5, physical 2953981952, root 5, inode 621, offset 11864412160, length 4096, links 1 (path: VMs/Virtualbox/Win102016_Alter-Firefox/Win102016_Alter-Firefox-disk1.vdi)
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2852, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590109339648 on dev /dev/bcache5, physical 2954113024, root 5, inode 418, offset 1030811648, length 4096, links 1 (path: VMs/Virtualbox/Windows 10 Imaging VMs/Windows 10 Imaging/Windows 10 Imaging-fixed.vdi)
BTRFS error (device bcache0): bdev /dev/bcache5 errs: wr 0, rd 0, flush 0, corrupt 2853, gen 0
BTRFS warning (device bcache0): checksum error at logical 4590109343744 on dev /dev/bcache5, physical 2954117120, root 5, inode 418, offset 1030815744, length 4096, links 1 (path: VMs/Virtualbox/Windows 10 Imaging VMs/Windows 10 Imaging

我的问题仍然存在：我下一步该做什么。清理？平衡？我做错了什么吗？如何解释只读清理中的错误和警告，btrfs 可以修复它们吗？

相关内容