如何调试我的服务器硬盘 I/O 问题

如何调试我的服务器硬盘 I/O 问题

我的局域网上有一个小型无头服务器,运行 Ubuntu 20.04。最近我遇到了一个问题,我的一个外部驱动器间歇性地进入只读模式,有时在重新启动后根本无法安装。这是一个较旧的机械驱动器,我以为它快要坏了。然后我买了一个新的 Crucial MX500 4TB 内置驱动器,安装它并开始 rsync 数据。

大约 60GB 之后,内部驱动器开始出现一些 I/O 错误。以下是系统日志中的几个片段:

Nov 15 20:46:33 hm80 kernel: [ 1138.328403] ata3.00: failed command: WRITE FPDMA QUEUED
Nov 15 20:46:33 hm80 kernel: [ 1138.328409] ata3.00: cmd 61/e8:00:08:0f:04/08:00:52:01:00/40 tag 0 ncq dma 1167360 ou
Nov 15 20:46:33 hm80 kernel: [ 1138.328409]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Nov 15 20:47:42 hm80 kernel: [ 1206.798072] sd 2:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798084] sd 2:0:0:0: [sdb] tag#18 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d9 00 00 00 09 c8 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798090] I/O error, dev sdb, sector 1275320576 op 0x1:(WRITE) flags 0x4000 phys_seg 44 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798105] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055013 starting block 159415072)
Nov 15 20:47:42 hm80 kernel: [ 1206.798123] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 10 writing to inode 179055014 starting block 159415080)

Nov 15 20:47:42 hm80 kernel: [ 1206.798250] sd 2:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=98s
Nov 15 20:47:42 hm80 kernel: [ 1206.798252] Buffer I/O error on device sdb1, logical block 159414816
Nov 15 20:47:42 hm80 kernel: [ 1206.798257] sd 2:0:0:0: [sdb] tag#19 CDB: Write(16) 8a 00 00 00 00 00 4c 03 d2 d0 00 00 06 20 00 00
Nov 15 20:47:42 hm80 kernel: [ 1206.798264] I/O error, dev sdb, sector 1275318992 op 0x1:(WRITE) flags 0x0 phys_seg 29 prio class 2
Nov 15 20:47:42 hm80 kernel: [ 1206.798286] Buffer I/O error on device sdb1, logical block 159414817
Nov 15 20:47:42 hm80 kernel: [ 1206.798296] Buffer I/O error on device sdb1, logical block 159414818

重启后一切似乎都正常了,我恢复了同步。一切都很顺利。同一驱动器还托管了从其他服务器同步的少量备份。这项工作整夜都在运行。今天,经过 12 小时的 rsync 后,机器几乎无响应,只能下载大约 400GB。查看 rsync 输出,它再次显示写入速度非常慢。然后一些非常奇怪的事情开始发生,比如我无法再通过 ssh 进入服务器,并且基本命令ls会挂起。

我打开电脑,检查了电缆连接,重启后,电脑居然进入了 GRUB,这是我以前从未见过的。现在它正在运行相同的 rsync 例程,一切似乎都正常了。

由于这一切都是为了应对另一个具有类似问题的外部驱动器,因此我正在考虑以下可能性:

  • 纯属巧合,我买到了坏的硬盘。请将其退回,然后再试一次。
  • 驱动器控制器或计算机本身有问题(这可能是我第一个驱动器出现问题的原因)
  • SATA 线坏了?

当问题间歇性出现时,我该如何调试此类问题?从这些系统日志条目中是否可以找到任何可以指出问题所在的东西?

更新:

$ e2fsck -c /dev/sdb1
e2fsck 1.46.5 (30-Dec-2021)
Checking for bad blocks (read-only test): done                                                 
/dev/sdb1: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/sdb1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdb1: 4281238/244195328 files (0.2% non-contiguous), 561744117/976754176 blocks
$ smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.2.0-36-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     CT4000MX500SSD1
Serial Number:    2320E6D55BEE
LU WWN Device Id: 5 00a075 1e6d55bee
Firmware Version: M3CR046
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Nov 16 19:30:08 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  30) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x0031) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       5
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       262
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   068   057   000    Old_age   Always       -       32 (Min/Max 25/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0030   100   100   001    Old_age   Offline      -       0
206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       4615089440
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       37591628
248 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       11026271

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Completed [00% left] (0-65535)
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

相关内容