ext4 文件系统损坏--可能是硬件错误?

ext4 文件系统损坏--可能是硬件错误?

dmesg我打开计算机大约半小时后出现这些错误:

 [ 1355.677957] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318420: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251700offset=0(0), inode=1802725748, rec_len=179136, name_len=32
 [ 1355.677973] Aborting journal on device sda2-8.
 [ 1355.678101] EXT4-fs (sda2): Remounting filesystem read-only
 [ 1355.690144] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318416: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251699offset=0(0), inode=2194783952, rec_len=53280, name_len=152
 [ 1356.864720] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1312795: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251176offset=1460(13748), inode=1432317541, rec_len=208208, name_len=119

/dev/sda是一个 SSD,它使用 noop 调度程序。

/etc/fstab入口:

UUID=acb4eefa-48ff-4ee1-bb5f-2dccce7d011f / ext4 errors=remount-ro,noatime,discard,user_xattr 0 1

系统信息:

$ cat /proc/mounts | grep /dev/sd
/dev/sda1 /boot ext2 rw,noatime,errors=continue 0 0
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.3 LTS"
$ uname -a
Linux leetpad 2.6.35-30-generic-pae #61~lucid1-Ubuntu SMP Thu Oct 13 21:14:29 UTC 2011 i686 GNU/Linux

输出smartctl -a

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     STT_FTM28GX25H
Serial Number:    P637510-MIBY-706A009
Firmware Version: 1916
User Capacity:    128,035,676,160 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Nov 24 20:53:48 2011 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (   0) seconds.
Offline data collection
capabilities:            (0x1d) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   0) minutes.
Extended self-test routine
recommended polling time:    (   0) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   005   000   000    Old_age   Offline  In_the_past 0
  9 Power_On_Hours          0x0000   141   002   000    Old_age   Offline      -       0
 12 Power_Cycle_Count       0x0000   115   002   000    Old_age   Offline      -       0
184 Unknown_Attribute       0x0000   084   000   000    Old_age   Offline  In_the_past 0
195 Hardware_ECC_Recovered  0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
196 Reallocated_Event_Count 0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
197 Current_Pending_Sector  0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
198 Offline_Uncorrectable   0x0000   002   107   000    Old_age   Offline      -       21198
199 UDMA_CRC_Error_Count    0x0000   063   003   000    Old_age   Offline      -       26957
200 Multi_Zone_Error_Rate   0x0000   099   124   000    Old_age   Offline      -       446
201 Soft_Read_Error_Rate    0x0000   024   154   000    Old_age   Offline      -       328
202 TA_Increase_Count       0x0000   115   254   000    Old_age   Offline      -       115
203 Run_Out_Cancel          0x0000   247   245   000    Old_age   Offline      -       83
204 Shock_Count_Write_Opern 0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
205 Shock_Rate_Write_Opern  0x0000   016   039   000    Old_age   Offline      -       0
206 Flying_Height           0x0000   005   000   000    Old_age   Offline  In_the_past 0
207 Spin_High_Current       0x0000   055   015   000    Old_age   Offline      -       0
208 Spin_Buzz               0x0000   248   001   000    Old_age   Offline      -       0
209 Offline_Seek_Performnce 0x0000   095   000   000    Old_age   Offline  In_the_past 0
211 Unknown_Attribute       0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
212 Unknown_Attribute       0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0
213 Unknown_Attribute       0x0000   000   000   000    Old_age   Offline  FAILING_NOW 0

Warning: device does not support Error Logging
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging

我已经运行 memtest 7 个小时了,没有发现任何内存错误。

在这种情况下,有什么明显的想法会出错吗?我能想到的最合理的事情是 SSD 默默地丢弃了一些写入请求,最终导致 EXT4 文件系统不一致(但没有磁盘 I/O 错误)。这怎么会发生?是否有相关的配置选项我应该确保正确设置?

我应该使用什么工具来诊断硬件故障?是否有可能在不覆盖数据的情况下诊断 SSD 故障?

答案1

198 Offline_Uncorrectable 0x0000 002 107 000 Old_age 离线 - 21198

它失败了,RMA 它。

您可能想要对其进行 SMART 测试,但是对于这样的值,这只是一种形式,它不太可能失败。

要运行测试,请使用

smartctl -t 长 /dev/sda

它会告诉您测试何时结束,然后您smartctl -a /dev/sda再次运行,它会在自检部分显示测试结果。

答案2

首先,您可能需要对根磁盘进行完整的 fsck。有时,我发现快速检查有时会遗漏一些重要错误。您可以通过触摸根目录中的文件(可能取决于 Linux 发行版)来执行此操作,但可以尝试

 touch /forcefsck

并重新启动或启动救援 CD 并在那里执行根目录的 fsck。完整是指使用 -f fsck 参数。

其次,您的系统日志是否指示任何硬件错误?

正如 Kario 先生所指出的,您可以使用 smartctl 检查磁盘健康状况。但我发现我使用过的一些磁盘没有报告信息。

相关内容