我的磁盘出现故障了吗?

我的磁盘出现故障了吗?

我正在对 Debian Buster 系统进行故障排除,该系统偶尔会变得无响应。看着dmesg,我看到一些令人担忧的消息弹出:

[Wed Apr 19 19:39:47 2023] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xe frozen                                   
[Wed Apr 19 19:39:47 2023] ata1.00: irq_stat 0x00000040, connection status changed                                                   
[Wed Apr 19 19:39:47 2023] ata1: SError: { PHYRdyChg CommWake DevExch }                                                              
[Wed Apr 19 19:39:47 2023] ata1.00: failed command: WRITE DMA EXT
[Wed Apr 19 19:39:47 2023] ata1.00: cmd 35/00:18:68:02:96/00:00:1d:00:00/e0 tag 19 dma 12288 out                                     
                                    res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)                               
[Wed Apr 19 19:39:47 2023] ata1.00: status: { DRDY }
[Wed Apr 19 19:39:47 2023] ata1: hard resetting link
[Wed Apr 19 19:39:48 2023] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                    
[Wed Apr 19 19:39:48 2023] ata1.00: supports DRM functions and may not be fully accessible                                           
[Wed Apr 19 19:39:48 2023] ata1.00: supports DRM functions and may not be fully accessible                                           
[Wed Apr 19 19:39:48 2023] ata1.00: configured for UDMA/33
[Wed Apr 19 19:39:48 2023] ata1: EH complete
[Wed Apr 19 19:39:48 2023] ata1.00: Enabling discard_zeroes_data

这些消息(重复出现)似乎表明 SATA 链路需要每隔几分钟重置一次。

我运行了扩展的 SMART 测试/dev/sda,它没有检测到任何故障(完整日志):

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.5-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 860 PRO 512GB
Serial Number:    S5HTNE0N107136V
LU WWN Device Id: 5 002538 e2014235a
Firmware Version: RVM02B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Apr 20 08:10:54 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   096   096   000    -    16420
 12 Power_Cycle_Count       -O--CK   099   099   000    -    372
177 Wear_Leveling_Count     PO--C-   099   099   000    -    17
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   079   045   000    -    21
195 Hardware_ECC_Recovered  -O-RC-   200   200   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   100   100   000    -    0
235 Unknown_Attribute       -O--C-   099   099   000    -    230
241 Total_LBAs_Written      -O--CK   099   099   000    -    9965553603

[...]

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     16411         -
# 2  Short offline       Completed without error       00%     16406         -
# 3  Short offline       Completed without error       00%     16405         -

我不认为这是文件系统的错误,但尽管如此,我尝试了内核命令行选项fsck.mode=force,但它似乎并没有实际检查除 EFI 分区之外的磁盘。

这是否表明某种故障模式,例如磁盘故障、连接不良或文件系统损坏?

相关内容