我有一对 WDD 驱动器,最近其中一个被踢出了 RAID1 阵列。
SMART 显示某些扇区存在 IO 错误,但所有 SMART 属性看起来都很好:
root@nas:~# smartctl -a /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.68.x86_64.1] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD20EFRX-68AX9N0
Serial Number: WD-WMC30xxxxxxxx
LU WWN Device Id: 5 0014ee 602ce8a27
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Sep 19 07:50:28 2017 WEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (26940) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 272) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 191 191 051 Pre-fail Always - 110178
3 Spin_Up_Time 0x0027 195 170 021 Pre-fail Always - 3208
4 Start_Stop_Count 0x0032 065 065 000 Old_age Always - 35326
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 042 042 000 Old_age Always - 43024
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 189 189 000 Old_age Always - 35311
194 Temperature_Celsius 0x0022 120 103 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 43000 hours (1791 days + 16 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 49 00 90 e0 Error: UNC at LBA = 0x00900049 = 9437257
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 00 01 49 00 90 e0 08 14d+22:24:56.107 READ VERIFY SECTOR(S)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 42999 9437257
# 2 Short offline Completed: read failure 40% 42999 9437257
# 3 Extended offline Completed without error 00% 39200 -
# 4 Extended offline Completed without error 00% 39033 -
# 5 Extended offline Completed without error 00% 38864 -
# 6 Extended offline Completed without error 00% 38708 -
# 7 Extended offline Completed without error 00% 38540 -
# 8 Extended offline Completed without error 00% 38396 -
# 9 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
因此在 9437257 处存在可重现的错误,我可以使用 dd 看到这一点:
root@nas:~# export i=9437257
root@nas:~# while [ $i -lt 9437280 ]; do echo $i; dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=$i; let i+=1; done
9437257
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.325588 s, 0.0 kB/s
9437258
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.164007 s, 0.0 kB/s
9437259
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.162149 s, 0.0 kB/s
9437260
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161994 s, 0.0 kB/s
9437261
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161854 s, 0.0 kB/s
9437262
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.16294 s, 0.0 kB/s
9437263
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.161955 s, 0.0 kB/s
9437264
1+0 records in
1+0 records out
512 bytes copied, 0.0212458 s, 24.1 kB/s
9437265
1+0 records in
1+0 records out
512 bytes copied, 0.000336436 s, 1.5 MB/s
9437266
1+0 records in
1+0 records out
512 bytes copied, 0.000300649 s, 1.7 MB/s
9437267
1+0 records in
1+0 records out
512 bytes copied, 0.000284451 s, 1.8 MB/s
9437268
1+0 records in
1+0 records out
512 bytes copied, 0.00031215 s, 1.6 MB/s
9437269
1+0 records in
1+0 records out
512 bytes copied, 0.000287936 s, 1.8 MB/s
9437270
1+0 records in
1+0 records out
512 bytes copied, 0.000302617 s, 1.7 MB/s
9437271
1+0 records in
1+0 records out
512 bytes copied, 0.000294914 s, 1.7 MB/s
9437272
1+0 records in
1+0 records out
512 bytes copied, 0.000713134 s, 718 kB/s
9437273
1+0 records in
1+0 records out
512 bytes copied, 0.000416336 s, 1.2 MB/s
9437274
1+0 records in
1+0 records out
512 bytes copied, 0.000289526 s, 1.8 MB/s
9437275
1+0 records in
1+0 records out
512 bytes copied, 0.000300769 s, 1.7 MB/s
9437276
1+0 records in
1+0 records out
512 bytes copied, 0.000294524 s, 1.7 MB/s
9437277
1+0 records in
1+0 records out
512 bytes copied, 0.000295592 s, 1.7 MB/s
9437278
1+0 records in
1+0 records out
512 bytes copied, 0.00034751 s, 1.5 MB/s
9437279
1+0 records in
1+0 records out
512 bytes copied, 0.000301789 s, 1.7 MB/s
root@nas:~#
我曾尝试写入这些扇区以重新分配它们,但只收到更多错误:
root@nas:~# dd if=/dev/zero of=/dev/sdb bs=512 count=7 seek=9437257
dd: error writing '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.168565 s, 0.0 kB/s
root@nas:~#
我也尝试过 sg_verify 和 sg_reassign:
root@nas:~# sg_verify /dev/sdb --lba=9437257
verify (10):
Descriptor format, current; Sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
Descriptor type: Information: 0x0000000000900049
VERIFY(10) medium or hardware error, reported lba=0x900049
root@nas:~# sg_reassign --address=9437257 /dev/sdb
REASSIGN BLOCKS: Illegal request, invalid opcode
journalctl 中也有很多噪音:
Sep 19 07:58:26 nas kernel: ata2.00: configured for UDMA/133
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 Sense Key : Medium Error [current] [descriptor]
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 Add. Sense: Unrecovered read error - auto reallocate failed
Sep 19 07:58:26 nas kernel: sd 1:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 00 90 00 49 00 00 01 00
Sep 19 07:58:26 nas kernel: blk_update_request: I/O error, dev sdb, sector 9437257
Sep 19 07:58:26 nas kernel: Buffer I/O error on dev sdb3, logical block 9, async page read
Sep 19 07:58:26 nas kernel: ata2: EH complete
Sep 19 07:58:26 nas kernel: ata2.00: exception Emask 0x0 SAct 0x1f800000 SErr 0x0 action 0x0
Sep 19 07:58:26 nas kernel: ata2.00: irq_stat 0x40000008
Sep 19 07:58:26 nas kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 19 07:58:26 nas kernel: ata2.00: cmd 60/01:e0:4a:00:90/00:00:00:00:00/40 tag 28 ncq 512 in
res 41/40:00:4a:00:90/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Sep 19 07:58:26 nas kernel: ata2.00: status: { DRDY ERR }
Sep 19 07:58:26 nas kernel: ata2.00: error: { UNC }
所以我的问题是:自动重新分配失败的原因是什么?是否可以恢复?正如我所说,数据受 RAID 保护,所以我不担心数据恢复。
答案1
smartctl -l selftest /dev/sbd
报告第一个坏块的位置。
申请程序描述如下smartmontools 的坏块解决方法.
它描述了以下步骤:
- 查找文件中是否存在坏块
- 恢复/修复文件
- 用零覆盖坏块 - 它应该允许自动修复
- 运行
smartctl -t short /dev/sdb
并转到 1,以防出现新的坏块 - 运行
smartctl -t long /dev/sdb
并转到 1,以防出现新的坏块
[ 短测试需要几分钟,长测试需要几个小时 ]