如何调查 ext4 文件系统是否变为只读（未报告硬件错误）？

2024-12-2 • tag-icon

我有两个独立的系统遇到看似相同的问题：

“桌面”- i7-7700K、ASUS Prime Z270-A、Ubuntu 22.04。内核 5.15.0-94-generic。
“服务器”- NUC8i3BEH，Ubuntu 18.04。内核 4.15.0-213-generic。

这些设备最初使用 250GB Samsung 970 EVO Plus NVMe 硬盘作为系统硬盘，使用时间超过四年，没有遇到任何问题。服务器过去已累计连续运行数月。

2023年12月，由于容量问题，两台机器都升级为4TB 三星 990 PRO 带散热器驱动器（型号 MZ-V9P4T0GW）。两个驱动器的生产日期均为 2023 年 2 月 11 日。

我曾经dd（编辑dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=128M status=progress：）将旧的 250GB 驱动器直接克隆到 4TB 驱动器上（未安装驱动器），然后扩展在新驱动器上创建的系统分区以填充额外的空间。

4TB 硬盘进入各自的系统后，系统正常启动。但是，自那时起，两台机器都定期进入只读文件系统模式；间隔时间各不相同，但通常在正常运行 2-3 天后。重新启动后，机器再次正常运行，直到下次出现问题。

两个驱动器均运行最新的三星固件，且未报告任何 SMART 错误——smartctl今天早上桌面的输出：

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-94-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Serial Number:                      S7HRNJ0WB00087A
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            561,545,682,944 [561 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4b31404e5a
Local Time is:                      Fri Feb  9 10:45:39 2024 GMT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    64,414,898 [32.9 TB]
Data Units Written:                 4,803,602 [2.45 TB]
Host Read Commands:                 497,378,636
Host Write Commands:                49,098,349
Controller Busy Time:               965
Power Cycles:                       286
Power On Hours:                     425
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               31 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

在 Windows 系统中安装驱动器后，使用 Samsung Magician 进行扩展 SMART 测试和完整 LBA 扫描时未报告任何错误。

在写入任何解释性系统日志之前，文件系统似乎处于只读状态，因此到目前为止我无法确定是什么导致了该问题。

上周我启动了实时环境并e2fsck -fv在每个驱动器上运行。这发现并修复了inode extent tree could be narrower错误。从那时起，问题一直在重复出现。

今天早上我在桌面上再次遇到了这个问题。然后我启动了一个实时 USB，e2fsck -fv再次运行，发现它报告并修复了更多 inode 范围树问题：

之后，我重新启动系统，系统运行正常；然而，离开 10 分钟后，我回到机器旁，系统再次出现故障（这是迄今为止最短的正常运行时间）。控制台上充满了这些错误：

__ext4_find_entry:1682: inode #2 (<process name>): reading lblock 0

我回到了实时 USB 环境，e2fsck -fv再次运行，发现这次报告不错误并且没有改变：

我重新启动后，系统运行正常（并且当天剩余时间一直保持运行状态）。与此同时，自上次出现问题以来，服务器累计正常运行时间刚好超过 2 天。

我现在不确定该怎么做。请问您能否就下一步该如何进行调查提出建议：

我首先想到的可能是物理硬件问题。但后续测试表明驱动器一直被报告为健康，现在感觉更像是文件系统问题。
值得注意的是，我认为，这两个系统的一切都不同，只是它们使用相同型号的驱动器，并且经历了相同的驱动器迁移过程。
我想知道这是否dd造成了损坏，尽管当时没有报告任何错误。我（还）不确定如何调查或解决这个问题。

相关内容