btrfs 在没有硬件问题的情况下自行崩溃？如何恢复？

2024-6-9 • tag-icon

我在使用 BTRFS 格式化的 LUKS 加密 SSD 上安装了 Manjaro KDE。系统大部分是最新的，上次更新可能是一两周前。

它突然切换为只读，dmesg 显示了几十个错误，如下所示：

parent transid verify failed on * wanted * found *

（顺便说一句，非常烦人。如果没有可写磁盘，Firefox 就无法加载页面，这就是我第一次注意到这个问题的方式）

我尝试重新启动，启动过程遇到了同样的问题，因为只能以只读方式挂载，所以我进入了紧急 root shell。

现在我正处于通过 USB 启动的实时会话中。我能够备份所有数据（无论如何，大多数数据都已备份，但只是为了确定）。

我尝试修复文件系统，这样我就不必重新安装系统，但到目前为止没有任何效果。我试过了：

# btrfs check --repair /dev/mapper/luks-455a911f-2d10-4548-a671-e1d4b8295bce 
enabling repair mode
WARNING:

        Do not use --repair unless you are advised to do so by a developer
        or an experienced user, and then only after having accepted that no
        fsck can successfully repair all types of filesystem corruption. Eg.
        some software or hardware bugs can fatally damage a volume.
        The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/mapper/luks-455a911f-2d10-4548-a671-e1d4b8295bce
UUID: 4dd8f7e8-5ffb-4405-b3d8-789ea877483d
[1/7] checking root items
parent transid verify failed on 429604962304 wanted 724347 found 724505
parent transid verify failed on 429604962304 wanted 724347 found 724505
Ignoring transid failure
parent transid verify failed on 429597687808 wanted 722745 found 724505
parent transid verify failed on 429597687808 wanted 722745 found 724505
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=449465008128 item=361 parent level=1 child bytenr=429597687808 child level=1
ERROR: failed to repair root items: Input/output error

还：

# btrfs scrub start /mnt/1/
scrub started on /mnt/1/, fsid 4dd8f7e8-5ffb-4405-b3d8-789ea877483d (pid=16297)
# btrfs scrub status /mnt/1/
UUID:             4dd8f7e8-5ffb-4405-b3d8-789ea877483d
Scrub started:    Tue Sep 21 11:31:18 2021
Status:           aborted
Duration:         0:00:01
Total to scrub:   446.54GiB
Rate:             2.47GiB/s
Error summary:    no errors found

btrfs 清理的 dmesg 输出：

[Sep21 09:32] BTRFS info (device dm-1): scrub: started on devid 1
[  +1.497704] BTRFS error (device dm-1): parent transid verify failed on 429604962304 wanted 724347 found 724505
[  +0.001083] BTRFS info (device dm-1): scrub: not finished on devid 1 with status: -5

还：

# btrfs filesystem balance start -dusage=60 -musage=60 /mnt/1
ERROR: error during balancing '/mnt/1': Input/output error
There may be more info in syslog - try dmesg | tail

平衡期间的 dmesg：

[Sep21 09:34] BTRFS info (device dm-1): disk space caching is enabled
[  +0.000011] BTRFS info (device dm-1): has skinny extents
[  +0.175630] BTRFS info (device dm-1): enabling ssd optimizations
[  +6.367553] BTRFS error (device dm-1): csum mismatch on free space cache
[  +0.000004] BTRFS warning (device dm-1): failed to load free space cache for block group 19349372928, rebuilding it now
[  +0.001476] BTRFS error (device dm-1): space cache generation (724505) does not match inode (720863)
[  +0.000003] BTRFS warning (device dm-1): failed to load free space cache for block group 21496856576, rebuilding it now
[  +0.149606] BTRFS info (device dm-1): balance: start -dusage=60 -musage=60 -susage=60
[  +0.000677] BTRFS info (device dm-1): relocating block group 670257119232 flags data
[  +1.210621] BTRFS info (device dm-1): found 7370 extents, stage: move data extents
[  +0.033649] BTRFS error (device dm-1): parent transid verify failed on 429608239104 wanted 724347 found 724505
[  +0.007518] BTRFS info (device dm-1): balance: ended with status: -5

由于没有一条 dmesg 行看起来像硬件问题，我怀疑这只是一个 btrfs 错误导致它自我毁灭......:(

此外，SMART看起来也不错：

# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 39 C (312 Kelvin)
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 5,042,443
data_units_written                      : 11,709,865
host_read_commands                      : 34,269,040
host_write_commands                     : 109,286,265
controller_busy_time                    : 284
power_cycles                            : 482
power_on_hours                          : 417
unsafe_shutdowns                        : 13
media_errors                            : 0
num_err_log_entries                     : 1
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

相关内容