我的问题描述相当大,所以首先我会做一个简短的总结,然后我会精确地描述情况。
简短摘要:制造商的诊断工具发现并修复了我的硬盘上的一些错误。据我了解工具手册,这些错误是坏块。然而,smartctl(在硬盘上执行 SMART 的 Linux 工具)没有显示任何重新分配的扇区,并表示硬盘状况良好。第一个问题:这怎么可能?修复坏块意味着重新分配扇区,对吗?那么为什么 smartctl 不报告任何重新分配的扇区呢?第二个问题:我几个月前购买了这张磁盘,并且仍然有保修。我是否应该要求卖家更换新的,或者该磁盘是否良好并且我可以继续使用它?
现在准确的描述:
我有西数硬盘,型号为WDC WD5000AAKX-001CA0。最近我注意到我的计算机有时会挂起几秒钟(大约一分钟)。挂起后 dmesg 显示如下错误:
knoppix@Microknoppix:~$ dmesg
(...)
[ 504.003363] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 504.003374] ata1.00: failed command: READ DMA EXT
[ 504.003383] ata1.00: cmd 25/00:00:80:07:01/00:02:00:00:00/e0 tag 0 dma 262144 in
[ 504.003385] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 504.003389] ata1.00: status: { DRDY }
[ 509.016652] ata1: link is slow to respond, please be patient (ready=0)
[ 514.030002] ata1: soft resetting link
[ 514.200386] ata1.00: configured for UDMA/133
[ 514.200420] ata1: EH complete
[ 546.003333] ata1: lost interrupt (Status 0x50)
[ 546.003364] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 546.003371] ata1.00: failed command: READ DMA EXT
[ 546.003380] ata1.00: cmd 25/00:00:80:15:06/00:02:00:00:00/e0 tag 0 dma 262144 in
[ 546.003381] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 546.003386] ata1.00: status: { DRDY }
[ 546.003401] ata1: soft resetting link
[ 546.181205] ata1.00: configured for UDMA/133
[ 546.181234] ata1: EH complete
然而,smartctl 说“SMART 整体健康自我评估测试结果:通过”(稍后我将粘贴 smartctl 的完整输出几段)。每当我尝试进行 smartctl 自测试(使用 smartctl -t Short 或 smartctl -t long)时,此类测试都会被报告为被主机中止。所以我为我的硬盘下载了可启动 CD 诊断工具 - 这个:http://support.wdc.com/product/download.asp?groupid=606&sid=2&lang=en
首先使用这个工具我做了快速测试,它显示错误(不幸的是,我不记得错误代码是什么)。据我了解,该工具仅执行智能快速自测试(http://wdc.custhelp.com/app/answers/detail/search/1/a_id/940/c/130/p/227,295 说“快速测试 -执行 SMART 驱动器快速自检,以收集并验证驱动器上包含的 Data Lifeguard 信息。”)然后我进行了扩展测试。据我了解,此扩展测试会查找坏扇区(http://wdc.custhelp.com/app/answers/detail/search/1/a_id/940/c/130/p/227,295 表示“扩展测试 -执行完整媒体扫描以检测坏扇区”)。一段时间后,该工具告知它发现并修复了一些错误。
现在我用 knoppix 启动机器并执行“smartctl --all”。这是它的输出:
root@Microknoppix:/home/knoppix# smartctl --all /dev/sda
smartctl 5.43 2012-06-05 r3561 [i686-linux-3.4.9] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA
Device Model: WDC WD5000AAKX-001CA0
Serial Number: WD-WMAYUW952768
LU WWN Device Id: 5 0014ee 6ad1d9ef1
Firmware Version: 15.01H15
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Dec 12 03:34:39 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 8160) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 83) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 486
3 Spin_Up_Time 0x0027 189 141 021 Pre-fail Always - 1525
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 587
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1553
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 578
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 173
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 413
194 Temperature_Celsius 0x0022 097 093 000 Old_age Always - 46
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 5
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 1548 hours (64 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 30 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d6 01 be 4f c2 a0 02 00:02:58.316 SMART WRITE LOG
b0 da 01 00 4f c2 a0 02 00:02:58.259 SMART RETURN STATUS
80 44 00 00 44 57 a0 02 00:02:58.259 [VENDOR SPECIFIC]
b0 d6 01 be 4f c2 a0 02 00:02:58.241 SMART WRITE LOG
80 45 00 01 44 57 a0 02 00:02:58.241 [VENDOR SPECIFIC]
Error 1 occurred at disk power-on lifetime: 1515 hours (63 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 30 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d6 01 be 4f c2 a0 02 00:02:21.841 SMART WRITE LOG
b0 da 01 00 4f c2 a0 02 00:02:21.784 SMART RETURN STATUS
80 44 00 00 44 57 a0 02 00:02:21.784 [VENDOR SPECIFIC]
b0 d6 01 be 4f c2 a0 02 00:02:21.768 SMART WRITE LOG
80 45 00 01 44 57 a0 02 00:02:21.768 [VENDOR SPECIFIC]
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 1552 -
# 2 Conveyance offline Completed: read failure 90% 1548 787927349
# 3 Conveyance offline Completed: read failure 90% 1515 883391611
# 4 Short offline Completed without error 00% 1503 -
# 5 Short offline Completed without error 00% 1503 -
# 6 Short offline Aborted by host 80% 1502 -
# 7 Extended offline Completed without error 00% 9 -
# 8 Short offline Completed without error 00% 6 -
# 9 Short offline Aborted by host 90% 6 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
正如您所看到的,一方面,一次离线传输已完成,但读取失败。但是,另一方面,所有属性似乎都不错 - 例如,Realulated_Sector_Ct 为 0。
我还再次尝试将整个磁盘转移到 /dev/null - dmesg 中再次出现错误:
root@Microknoppix:/home/knoppix# nice -n 20 ionice -c 3 cat /dev/sda > /dev/null
During this cat dmesg shows such errors:
knoppix@Microknoppix:~$ dmesg
(...)
[ 504.003363] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 504.003374] ata1.00: failed command: READ DMA EXT
[ 504.003383] ata1.00: cmd 25/00:00:80:07:01/00:02:00:00:00/e0 tag 0 dma 262144 in
[ 504.003385] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 504.003389] ata1.00: status: { DRDY }
[ 509.016652] ata1: link is slow to respond, please be patient (ready=0)
[ 514.030002] ata1: soft resetting link
[ 514.200386] ata1.00: configured for UDMA/133
[ 514.200420] ata1: EH complete
[ 546.003333] ata1: lost interrupt (Status 0x50)
[ 546.003364] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 546.003371] ata1.00: failed command: READ DMA EXT
[ 546.003380] ata1.00: cmd 25/00:00:80:15:06/00:02:00:00:00/e0 tag 0 dma 262144 in
[ 546.003381] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 546.003386] ata1.00: status: { DRDY }
[ 546.003401] ata1: soft resetting link
[ 546.181205] ata1.00: configured for UDMA/133
[ 546.181234] ata1: EH complete
我认为这可能是主板或连接磁盘到主板的数据线的故障。因此,我使用相同的电缆和插槽将另一个磁盘连接到我的主板,并将其连接到 /dev/null。它成功了,dmesg 没有显示任何错误。
答案1
没有重新分配的扇区,因为它们未能重新分配。您的驱动器显示 5 个 Offline_Un Correctable 扇区,这是自动修复失败时发生的情况。 dmesg 输出中显示明显的读取失败、SMART 错误以及 SMART 测试的读取失败。正如您在问题中提到的,有多种修复这些扇区的方法,但根据我的经验,这是一个非常短期的修复。
更换驱动器。