我的局域网上有一个小型无头服务器,运行 Ubuntu 20.04。最近我遇到了一个问题,我的一个外部驱动器间歇性地进入只读模式,有时在重新启动后根本无法安装。这是一个较旧的机械驱动器,我以为它快要坏了。然后我买了一个新的 Crucial MX500 4TB 内置驱动器,安装它并开始 rsync 数据。
大约 60GB 之后,内部驱动器开始出现一些 I/O 错误。以下是系统日志中的几个片段:
Nov 15 20:46:33 hm80 kernel: [ 1138.328403] ata3.00: failed command: WRITE FPDMA QUEUED
Nov 15 20:46:33 hm80 kernel: [ 1138.328409] ata3.00: cmd 61/e8:00:08:0f:04/08:00:52:01:00/40 tag 0 ncq dma 1167360 ou
Nov 15 20:46:33 hm80 kernel: [ 1138.328409] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 15 20:47:42 hm80 kernel: [ 1206.798072] sd 2:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798084] sd 2:0:0:0: [sdb] tag#18 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d9 00 00 00 09 c8 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798090] I/O error, dev sdb, sector 1275320576 op 0x1:(WRITE) flags 0x4000 phys_seg 44 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798105] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055013 starting block 159415072)
Nov 15 20:47:42 hm80 kernel: [ 1206.798123] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055014 starting block 159415080)
Nov 15 20:47:42 hm80 kernel: [ 1206.798250] sd 2:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798252] Buffer I/O error on device sdb1, logical block 159414816
Nov 15 20:47:42 hm80 kernel: [ 1206.798257] sd 2:0:0:0: [sdb] tag#19 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d2 d0 00 00 06 20 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798264] I/O error, dev sdb, sector 1275318992 op 0x1:(WRITE) flags 0x0 phys_seg 29 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798286] Buffer I/O error on device sdb1, logical block 159414817
Nov 15 20:47:42 hm80 kernel: [ 1206.798296] Buffer I/O error on device sdb1, logical block 159414818
重启后一切似乎都正常了,我恢复了同步。一切都很顺利。同一驱动器还托管了从其他服务器同步的少量备份。这项工作整夜都在运行。今天,经过 12 小时的 rsync 后,机器几乎无响应,只能下载大约 400GB。查看 rsync 输出,它再次显示写入速度非常慢。然后一些非常奇怪的事情开始发生,比如我无法再通过 ssh 进入服务器,并且基本命令ls
会挂起。
我打开电脑,检查了电缆连接,重启后,电脑居然进入了 GRUB,这是我以前从未见过的。现在它正在运行相同的 rsync 例程,一切似乎都正常了。
由于这一切都是为了应对另一个具有类似问题的外部驱动器,因此我正在考虑以下可能性:
- 纯属巧合,我买到了坏的硬盘。请将其退回,然后再试一次。
- 驱动器控制器或计算机本身有问题(这可能是我第一个驱动器出现问题的原因)
- SATA 线坏了?
当问题间歇性出现时,我该如何调试此类问题?从这些系统日志条目中是否可以找到任何可以指出问题所在的东西?
更新:
$ e2fsck -c /dev/sdb1
e2fsck 1.46.5 (30-Dec-2021)
Checking for bad blocks (read-only test): done
/dev/sdb1: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdb1: 4281238/244195328 files (0.2% non-contiguous), 561744117/976754176 blocks
$ smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.2.0-36-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: CT4000MX500SSD1
Serial Number: 2320E6D55BEE
LU WWN Device Id: 5 00a075 1e6d55bee
Firmware Version: M3CR046
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Nov 16 19:30:08 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 22
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 5
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 262
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 068 057 000 Old_age Always - 32 (Min/Max 25/43)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Unknown_SSD_Attribute 0x0030 100 100 001 Old_age Offline - 0
206 Unknown_SSD_Attribute 0x000e 100 100 000 Old_age Always - 0
210 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 4615089440
247 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 37591628
248 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 11026271
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.