HDD 坏扇区未重新分配，但没有智能错误

2024-5-31 • tag-icon

我有一对 WDD 驱动器，最近其中一个被踢出了 RAID1 阵列。

SMART 显示某些扇区存在 IO 错误，但所有 SMART 属性看起来都很好：

root@nas:~# smartctl -a /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.68.x86_64.1] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC30xxxxxxxx
LU WWN Device Id: 5 0014ee 602ce8a27
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Sep 19 07:50:28 2017 WEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:        (26940) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 272) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x70bd) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   191   191   051    Pre-fail  Always       -       110178
  3 Spin_Up_Time            0x0027   195   170   021    Pre-fail  Always       -       3208
  4 Start_Stop_Count        0x0032   065   065   000    Old_age   Always       -       35326
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   042   042   000    Old_age   Always       -       43024
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       14
193 Load_Cycle_Count        0x0032   189   189   000    Old_age   Always       -       35311
194 Temperature_Celsius     0x0022   120   103   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 43000 hours (1791 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 49 00 90 e0  Error: UNC at LBA = 0x00900049 = 9437257

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  40 00 01 49 00 90 e0 08  14d+22:24:56.107  READ VERIFY SECTOR(S)

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     42999         9437257
# 2  Short offline       Completed: read failure       40%     42999         9437257
# 3  Extended offline    Completed without error       00%     39200         -
# 4  Extended offline    Completed without error       00%     39033         -
# 5  Extended offline    Completed without error       00%     38864         -
# 6  Extended offline    Completed without error       00%     38708         -
# 7  Extended offline    Completed without error       00%     38540         -
# 8  Extended offline    Completed without error       00%     38396         -
# 9  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

因此在 9437257 处存在可重现的错误，我可以使用 dd 看到这一点：

root@nas:~# export i=9437257
root@nas:~# while [ $i -lt 9437280 ]; do echo $i; dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=$i; let i+=1; done
9437257
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.325588 s, 0.0 kB/s
9437258
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.164007 s, 0.0 kB/s
9437259
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.162149 s, 0.0 kB/s
9437260
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161994 s, 0.0 kB/s
9437261
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161854 s, 0.0 kB/s
9437262
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.16294 s, 0.0 kB/s
9437263
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161955 s, 0.0 kB/s
9437264
1+0 records in
1+0 records out
512 bytes copied, 0.0212458 s, 24.1 kB/s
9437265
1+0 records in
1+0 records out
512 bytes copied, 0.000336436 s, 1.5 MB/s
9437266
1+0 records in
1+0 records out
512 bytes copied, 0.000300649 s, 1.7 MB/s
9437267
1+0 records in
1+0 records out
512 bytes copied, 0.000284451 s, 1.8 MB/s
9437268
1+0 records in
1+0 records out
512 bytes copied, 0.00031215 s, 1.6 MB/s
9437269
1+0 records in
1+0 records out
512 bytes copied, 0.000287936 s, 1.8 MB/s
9437270
1+0 records in
1+0 records out
512 bytes copied, 0.000302617 s, 1.7 MB/s
9437271
1+0 records in
1+0 records out
512 bytes copied, 0.000294914 s, 1.7 MB/s
9437272
1+0 records in
1+0 records out
512 bytes copied, 0.000713134 s, 718 kB/s
9437273
1+0 records in
1+0 records out
512 bytes copied, 0.000416336 s, 1.2 MB/s
9437274
1+0 records in
1+0 records out
512 bytes copied, 0.000289526 s, 1.8 MB/s
9437275
1+0 records in
1+0 records out
512 bytes copied, 0.000300769 s, 1.7 MB/s
9437276
1+0 records in
1+0 records out
512 bytes copied, 0.000294524 s, 1.7 MB/s
9437277
1+0 records in
1+0 records out
512 bytes copied, 0.000295592 s, 1.7 MB/s
9437278
1+0 records in
1+0 records out
512 bytes copied, 0.00034751 s, 1.5 MB/s
9437279
1+0 records in
1+0 records out
512 bytes copied, 0.000301789 s, 1.7 MB/s
root@nas:~#

我曾尝试写入这些扇区以重新分配它们，但只收到更多错误：

root@nas:~# dd if=/dev/zero of=/dev/sdb bs=512 count=7 seek=9437257
dd: error writing '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.168565 s, 0.0 kB/s
root@nas:~#

我也尝试过 sg_verify 和 sg_reassign：

root@nas:~# sg_verify /dev/sdb --lba=9437257
verify (10):
Descriptor format, current; Sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
  Descriptor type: Information: 0x0000000000900049
VERIFY(10) medium or hardware error, reported lba=0x900049
root@nas:~# sg_reassign --address=9437257 /dev/sdb
REASSIGN BLOCKS: Illegal request, invalid opcode

journalctl 中也有很多噪音：

Sep 19 07:58:26 nas kernel: ata2.00: configured for UDMA/133
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 Sense Key : Medium Error [current] [descriptor]
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 Add. Sense: Unrecovered read error - auto reallocate failed
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 00 90 00 49 00 00 01 00
Sep 19 07:58:26 nas kernel: blk_update_request: I/O error, dev sdb, sector 9437257
Sep 19 07:58:26 nas kernel: Buffer I/O error on dev sdb3, logical block 9, async page read
Sep 19 07:58:26 nas kernel: ata2: EH complete
Sep 19 07:58:26 nas kernel: ata2.00: exception Emask 0x0 SAct 0x1f800000 SErr 0x0 action 0x0
Sep 19 07:58:26 nas kernel: ata2.00: irq_stat 0x40000008
Sep 19 07:58:26 nas kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 19 07:58:26 nas kernel: ata2.00: cmd 60/01:e0:4a:00:90/00:00:00:00:00/40 tag 28 ncq 512 in
                                     res 41/40:00:4a:00:90/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Sep 19 07:58:26 nas kernel: ata2.00: status: { DRDY ERR }
Sep 19 07:58:26 nas kernel: ata2.00: error: { UNC }

所以我的问题是：自动重新分配失败的原因是什么？是否可以恢复？正如我所说，数据受 RAID 保护，所以我不担心数据恢复。

答案1

smartctl -l selftest /dev/sbd报告第一个坏块的位置。

申请程序描述如下smartmontools 的坏块解决方法.
它描述了以下步骤：

查找文件中是否存在坏块
恢复/修复文件
用零覆盖坏块 - 它应该允许自动修复
运行smartctl -t short /dev/sdb并转到 1，以防出现新的坏块
运行smartctl -t long /dev/sdb并转到 1，以防出现新的坏块

[ 短测试需要几分钟，长测试需要几个小时 ]

答案1

相关内容