整个根文件系统变为只读

2024-5-16 • tag-icon

当我下载一个巨大的数据存档（大约500GB）但我当时没有保留输出和日志时，这个问题第一次出现。

我重新启动系统，它自动进入紧急模式。我使用 fsck 并解决了这个问题，但几个小时后又发生了。这次我发现整个 root fs 甚至 /tmp 都是只读的（有人让我试试这个）。这是 dmesg 的最后输出：

[35761.273361] ata4.00: exception Emask 0x0 SAct 0x1800 SErr 0x0 action 0x0
[35761.273373] ata4.00: irq_stat 0x40000008
[35761.273379] ata4.00: failed command: READ FPDMA QUEUED
[35761.273386] ata4.00: cmd 60/00:58:c0:31:a1/02:00:38:00:00/40 tag 11 ncq dma 262144 in
                        res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[35761.273394] ata4.00: status: { DRDY ERR }
[35761.273398] ata4.00: error: { UNC }
[35761.276060] ata4.00: configured for UDMA/133
[35761.276077] sd 3:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[35761.276083] sd 3:0:0:0: [sdb] tag#11 Sense Key : Medium Error [current]
[35761.276089] sd 3:0:0:0: [sdb] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[35761.276095] sd 3:0:0:0: [sdb] tag#11 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 c0 00 00 02 00 00 00
[35761.276101] print_req_error: I/O error, dev sdb, sector 950088179
[35761.276117] ata4: EH complete
[38523.236782] ata4.00: exception Emask 0x0 SAct 0x18080 SErr 0x0 action 0x0
[38523.236793] ata4.00: irq_stat 0x40000001
[38523.236797] ata4.00: failed command: READ FPDMA QUEUED
[38523.236802] ata4.00: cmd 60/08:38:f0:31:a1/00:00:38:00:00/40 tag 7 ncq dma 4096 in
                        res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38523.236807] ata4.00: status: { DRDY ERR }
[38523.236810] ata4.00: error: { UNC }
[38523.236813] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236821] ata4.00: cmd 61/40:78:80:b9:81/09:00:30:00:00/40 tag 15 ncq dma 1212416 ou
                        res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236825] ata4.00: status: { DRDY ERR }
[38523.236828] ata4.00: error: { UNC }
[38523.236830] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236834] ata4.00: cmd 61/70:80:e8:4d:9f/00:00:ea:00:00/40 tag 16 ncq dma 57344 out
                        res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236838] ata4.00: status: { DRDY ERR }
[38523.236840] ata4.00: error: { UNC }
[38523.238584] ata4.00: configured for UDMA/133
[38523.238607] sd 3:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238615] sd 3:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]
[38523.238622] sd 3:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238628] sd 3:0:0:0: [sdb] tag#7 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 f0 00 00 00 08 00 00
[38523.238634] print_req_error: I/O error, dev sdb, sector 950088179
[38523.238659] sd 3:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238664] sd 3:0:0:0: [sdb] tag#15 Sense Key : Medium Error [current]
[38523.238668] sd 3:0:0:0: [sdb] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238674] sd 3:0:0:0: [sdb] tag#15 CDB: Write(16) 8a 00 00 00 00 00 30 81 b9 80 00 00 09 40 00 00
[38523.238679] print_req_error: I/O error, dev sdb, sector 813808000
[38523.238687] EXT4-fs warning (device sdb3): ext4_end_bio:323: I/O error 10 writing to inode 56511830 (offset 26411008 size 1212416 starting block 101726296)
[38523.238694] Buffer I/O error on device sdb3, logical block 93788464
[38523.238704] Buffer I/O error on device sdb3, logical block 93788465
[38523.238708] Buffer I/O error on device sdb3, logical block 93788466
[38523.238713] Buffer I/O error on device sdb3, logical block 93788467
[38523.238717] Buffer I/O error on device sdb3, logical block 93788468
[38523.238722] Buffer I/O error on device sdb3, logical block 93788469
[38523.238728] Buffer I/O error on device sdb3, logical block 93788470
[38523.238733] Buffer I/O error on device sdb3, logical block 93788471
[38523.238738] Buffer I/O error on device sdb3, logical block 93788472
[38523.238747] Buffer I/O error on device sdb3, logical block 93788473
[38523.238982] JBD2: Detected IO errors while flushing file data on sdb3-8
[38523.238984] sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238995] sd 3:0:0:0: [sdb] tag#16 Sense Key : Medium Error [current]
[38523.238999] sd 3:0:0:0: [sdb] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.239005] sd 3:0:0:0: [sdb] tag#16 CDB: Write(16) 8a 00 00 00 00 00 ea 9f 4d e8 00 00 00 70 00 00
[38523.239010] print_req_error: I/O error, dev sdb, sector 3936308712
[38523.239026] ata4: EH complete
[38523.239032] Aborting journal on device sdb3-8.
[38523.239045] EXT4-fs (sdb3): Delayed block allocation failed for inode 56511830 at logical offset 6748 with max blocks 120 with error 30
[38523.239055] EXT4-fs (sdb3): This should not happen!! Data will be lost

[38523.239643] EXT4-fs error (device sdb3) in ext4_writepages:2906: IO failure
[38523.296445] EXT4-fs (sdb3): Remounting filesystem read-only
[38523.296477] EXT4-fs error (device sdb3): ext4_journal_check_start:61: Detected aborted journal
[38525.832744] ata4.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x0
[38525.833100] ata4.00: irq_stat 0x40000008
[38525.833365] ata4.00: failed command: READ FPDMA QUEUED
[38525.833629] ata4.00: cmd 60/80:20:c0:3b:a1/00:00:38:00:00/40 tag 4 ncq dma 65536 in
                        res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38525.834152] ata4.00: status: { DRDY ERR }
[38525.834415] ata4.00: error: { UNC }
[38525.836456] ata4.00: configured for UDMA/133
[38525.836737] sd 3:0:0:0: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38525.837001] sd 3:0:0:0: [sdb] tag#4 Sense Key : Medium Error [current]
[38525.837267] sd 3:0:0:0: [sdb] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
[38525.837531] sd 3:0:0:0: [sdb] tag#4 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b c0 00 00 00 80 00 00
[38525.837796] print_req_error: I/O error, dev sdb, sector 950090727
[38525.838072] ata4: EH complete
[38528.260746] ata4.00: exception Emask 0x0 SAct 0x400 SErr 0x0 action 0x0
[38528.261092] ata4.00: irq_stat 0x40000008
[38528.261357] ata4.00: failed command: READ FPDMA QUEUED
[38528.261623] ata4.00: cmd 60/08:50:e0:3b:a1/00:00:38:00:00/40 tag 10 ncq dma 4096 in
                        res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38528.262144] ata4.00: status: { DRDY ERR }
[38528.262405] ata4.00: error: { UNC }
[38528.264870] ata4.00: configured for UDMA/133
[38528.265149] sd 3:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38528.265410] sd 3:0:0:0: [sdb] tag#10 Sense Key : Medium Error [current]
[38528.265668] sd 3:0:0:0: [sdb] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
[38528.265923] sd 3:0:0:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b e0 00 00 00 08 00 00
[38528.266182] print_req_error: I/O error, dev sdb, sector 950090727
[38528.266459] ata4: EH complete
[54010.452717] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56341.190097] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56572.048951] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56633.963486] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user

此后，不再记录任何消息，因为整个 rootfs 已变为只读。然后我尝试：

# mount / -o remount,rw
mount: /: cannot remount /dev/sdb3 read-write, is write-protected.

它失败了，但幸运的是其他硬盘上的分区没有问题，因此我可以将本地构建的 smartctl 上传到服务器并查看智能信息：

# /other/smartmontools-7.2/smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD4002FYYZ-01B7CB0
Serial Number:    WD-N8G6724Y
LU WWN Device Id: 5 0014ee 25f546502
Firmware Version: 01.01K03
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 29 23:11:25 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (49440) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 533) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   198   197   051    Pre-fail  Always       -       14
  3 Spin_Up_Time            0x0027   175   149   021    Pre-fail  Always       -       10208
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       29
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15352
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       29
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       10
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       8

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     15351         788280736
# 2  Short offline       Completed: read failure       90%     15327         788280736
# 3  Short offline       Completed without error       00%        97         -
# 4  Extended offline    Aborted by host               90%        34         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我不完全确定这是硬盘的问题，因为我不能很好地理解这些信息 - 我的母语不是英语，请原谅我的错误措辞：/

答案1

根据您的 SMART 日志，您有坏扇区，换句话说，您的驱动器快要死了。多快？没人知道。始终进行备份并进行测试。

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       10

# 1  Extended offline    Completed: read failure       90%     15351         788280736
# 2  Short offline         Completed: read failure       90%     15327         788280736

您可以使用这些手册来尝试强制您的驱动器重新分配它：

无论如何，您都需要e2fsck -c为受影响的分区运行。

维基百科对 SMART 有一个很好的概述：https://en.wikipedia.org/wiki/SMART。

答案1

相关内容