当我下载一个巨大的数据存档(大约500GB)但我当时没有保留输出和日志时,这个问题第一次出现。
我重新启动系统,它自动进入紧急模式。我使用 fsck 并解决了这个问题,但几个小时后又发生了。这次我发现整个 root fs 甚至 /tmp 都是只读的(有人让我试试这个)。这是 dmesg 的最后输出:
[35761.273361] ata4.00: exception Emask 0x0 SAct 0x1800 SErr 0x0 action 0x0
[35761.273373] ata4.00: irq_stat 0x40000008
[35761.273379] ata4.00: failed command: READ FPDMA QUEUED
[35761.273386] ata4.00: cmd 60/00:58:c0:31:a1/02:00:38:00:00/40 tag 11 ncq dma 262144 in
res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[35761.273394] ata4.00: status: { DRDY ERR }
[35761.273398] ata4.00: error: { UNC }
[35761.276060] ata4.00: configured for UDMA/133
[35761.276077] sd 3:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[35761.276083] sd 3:0:0:0: [sdb] tag#11 Sense Key : Medium Error [current]
[35761.276089] sd 3:0:0:0: [sdb] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[35761.276095] sd 3:0:0:0: [sdb] tag#11 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 c0 00 00 02 00 00 00
[35761.276101] print_req_error: I/O error, dev sdb, sector 950088179
[35761.276117] ata4: EH complete
[38523.236782] ata4.00: exception Emask 0x0 SAct 0x18080 SErr 0x0 action 0x0
[38523.236793] ata4.00: irq_stat 0x40000001
[38523.236797] ata4.00: failed command: READ FPDMA QUEUED
[38523.236802] ata4.00: cmd 60/08:38:f0:31:a1/00:00:38:00:00/40 tag 7 ncq dma 4096 in
res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38523.236807] ata4.00: status: { DRDY ERR }
[38523.236810] ata4.00: error: { UNC }
[38523.236813] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236821] ata4.00: cmd 61/40:78:80:b9:81/09:00:30:00:00/40 tag 15 ncq dma 1212416 ou
res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236825] ata4.00: status: { DRDY ERR }
[38523.236828] ata4.00: error: { UNC }
[38523.236830] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236834] ata4.00: cmd 61/70:80:e8:4d:9f/00:00:ea:00:00/40 tag 16 ncq dma 57344 out
res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236838] ata4.00: status: { DRDY ERR }
[38523.236840] ata4.00: error: { UNC }
[38523.238584] ata4.00: configured for UDMA/133
[38523.238607] sd 3:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238615] sd 3:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]
[38523.238622] sd 3:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238628] sd 3:0:0:0: [sdb] tag#7 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 f0 00 00 00 08 00 00
[38523.238634] print_req_error: I/O error, dev sdb, sector 950088179
[38523.238659] sd 3:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238664] sd 3:0:0:0: [sdb] tag#15 Sense Key : Medium Error [current]
[38523.238668] sd 3:0:0:0: [sdb] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238674] sd 3:0:0:0: [sdb] tag#15 CDB: Write(16) 8a 00 00 00 00 00 30 81 b9 80 00 00 09 40 00 00
[38523.238679] print_req_error: I/O error, dev sdb, sector 813808000
[38523.238687] EXT4-fs warning (device sdb3): ext4_end_bio:323: I/O error 10 writing to inode 56511830 (offset 26411008 size 1212416 starting block 101726296)
[38523.238694] Buffer I/O error on device sdb3, logical block 93788464
[38523.238704] Buffer I/O error on device sdb3, logical block 93788465
[38523.238708] Buffer I/O error on device sdb3, logical block 93788466
[38523.238713] Buffer I/O error on device sdb3, logical block 93788467
[38523.238717] Buffer I/O error on device sdb3, logical block 93788468
[38523.238722] Buffer I/O error on device sdb3, logical block 93788469
[38523.238728] Buffer I/O error on device sdb3, logical block 93788470
[38523.238733] Buffer I/O error on device sdb3, logical block 93788471
[38523.238738] Buffer I/O error on device sdb3, logical block 93788472
[38523.238747] Buffer I/O error on device sdb3, logical block 93788473
[38523.238982] JBD2: Detected IO errors while flushing file data on sdb3-8
[38523.238984] sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238995] sd 3:0:0:0: [sdb] tag#16 Sense Key : Medium Error [current]
[38523.238999] sd 3:0:0:0: [sdb] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.239005] sd 3:0:0:0: [sdb] tag#16 CDB: Write(16) 8a 00 00 00 00 00 ea 9f 4d e8 00 00 00 70 00 00
[38523.239010] print_req_error: I/O error, dev sdb, sector 3936308712
[38523.239026] ata4: EH complete
[38523.239032] Aborting journal on device sdb3-8.
[38523.239045] EXT4-fs (sdb3): Delayed block allocation failed for inode 56511830 at logical offset 6748 with max blocks 120 with error 30
[38523.239055] EXT4-fs (sdb3): This should not happen!! Data will be lost
[38523.239643] EXT4-fs error (device sdb3) in ext4_writepages:2906: IO failure
[38523.296445] EXT4-fs (sdb3): Remounting filesystem read-only
[38523.296477] EXT4-fs error (device sdb3): ext4_journal_check_start:61: Detected aborted journal
[38525.832744] ata4.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x0
[38525.833100] ata4.00: irq_stat 0x40000008
[38525.833365] ata4.00: failed command: READ FPDMA QUEUED
[38525.833629] ata4.00: cmd 60/80:20:c0:3b:a1/00:00:38:00:00/40 tag 4 ncq dma 65536 in
res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38525.834152] ata4.00: status: { DRDY ERR }
[38525.834415] ata4.00: error: { UNC }
[38525.836456] ata4.00: configured for UDMA/133
[38525.836737] sd 3:0:0:0: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38525.837001] sd 3:0:0:0: [sdb] tag#4 Sense Key : Medium Error [current]
[38525.837267] sd 3:0:0:0: [sdb] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
[38525.837531] sd 3:0:0:0: [sdb] tag#4 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b c0 00 00 00 80 00 00
[38525.837796] print_req_error: I/O error, dev sdb, sector 950090727
[38525.838072] ata4: EH complete
[38528.260746] ata4.00: exception Emask 0x0 SAct 0x400 SErr 0x0 action 0x0
[38528.261092] ata4.00: irq_stat 0x40000008
[38528.261357] ata4.00: failed command: READ FPDMA QUEUED
[38528.261623] ata4.00: cmd 60/08:50:e0:3b:a1/00:00:38:00:00/40 tag 10 ncq dma 4096 in
res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38528.262144] ata4.00: status: { DRDY ERR }
[38528.262405] ata4.00: error: { UNC }
[38528.264870] ata4.00: configured for UDMA/133
[38528.265149] sd 3:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38528.265410] sd 3:0:0:0: [sdb] tag#10 Sense Key : Medium Error [current]
[38528.265668] sd 3:0:0:0: [sdb] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
[38528.265923] sd 3:0:0:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b e0 00 00 00 08 00 00
[38528.266182] print_req_error: I/O error, dev sdb, sector 950090727
[38528.266459] ata4: EH complete
[54010.452717] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56341.190097] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56572.048951] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56633.963486] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
此后,不再记录任何消息,因为整个 rootfs 已变为只读。然后我尝试:
# mount / -o remount,rw
mount: /: cannot remount /dev/sdb3 read-write, is write-protected.
它失败了,但幸运的是其他硬盘上的分区没有问题,因此我可以将本地构建的 smartctl 上传到服务器并查看智能信息:
# /other/smartmontools-7.2/smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Gold
Device Model: WDC WD4002FYYZ-01B7CB0
Serial Number: WD-N8G6724Y
LU WWN Device Id: 5 0014ee 25f546502
Firmware Version: 01.01K03
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jun 29 23:11:25 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (49440) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 533) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 198 197 051 Pre-fail Always - 14
3 Spin_Up_Time 0x0027 175 149 021 Pre-fail Always - 10208
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 29
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15352
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 29
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 19
194 Temperature_Celsius 0x0022 109 098 000 Old_age Always - 43
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 6
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 8
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 15351 788280736
# 2 Short offline Completed: read failure 90% 15327 788280736
# 3 Short offline Completed without error 00% 97 -
# 4 Extended offline Aborted by host 90% 34 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
我不完全确定这是硬盘的问题,因为我不能很好地理解这些信息 - 我的母语不是英语,请原谅我的错误措辞:/
答案1
根据您的 SMART 日志,您有坏扇区,换句话说,您的驱动器快要死了。多快?没人知道。始终进行备份并进行测试。
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10
# 1 Extended offline Completed: read failure 90% 15351 788280736
# 2 Short offline Completed: read failure 90% 15327 788280736
您可以使用这些手册来尝试强制您的驱动器重新分配它:
- https://www.smartmontools.org/wiki/BadBlockHowto
- https://linoxy.com/how-to-fix-repair-bad-blocks-in-linux/
无论如何,您都需要e2fsck -c
为受影响的分区运行。
维基百科对 SMART 有一个很好的概述:https://en.wikipedia.org/wiki/SMART。