简单来说:我想知道 SMART 出现故障的 HDD 是否可以通过任何方式修复,如果可以,它是否仍然足够可靠。
详细信息:我有一块使用了 4 年的 1TB 西部数据硬盘 (WD10JPVX-08JC3T6),之前没有出现过问题。
disk /dev/sda: 931.5 gib, 1000204886016 bytes, 1953525168 sectors
units: sectors of 1 * 512 = 512 bytes
sector size (logical/physical): 512 bytes / 4096 bytes
i/o size (minimum/optimal): 4096 bytes / 4096 bytes
disklabel type: gpt
disk identifier: c700a041-8c28-42e8-9adb-24a5f86b961a
device start end sectors size type
/dev/sda1 2048 1050623 1048576 512m efi system
/dev/sda2 1050624 1936945151 1935894528 923.1g linux filesystem
/dev/sda3 1936945152 1953523711 16578560 7.9g linux swap
突然(值得一提:可能是在高湿度条件下)我发现我的 Debian 根分区sda2
是只读的。我进行了一次长时间samrtctl
的测试,该测试已完成read failure
,Current_Pending_Sector
并且109
是。Reallocated_Event_Count
0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10JPVX-08JC3T6
Serial Number: WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Sep 23 14:49:11 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (18000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 677
3 Spin_Up_Time 0x0027 189 179 021 Pre-fail Always - 1533
4 Start_Stop_Count 0x0032 091 091 000 Old_age Always - 9943
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13700
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1834
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 119
193 Load_Cycle_Count 0x0032 139 139 000 Old_age Always - 185919
194 Temperature_Celsius 0x0022 113 089 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 109
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
240 Head_Flying_Hours 0x0032 084 084 000 Old_age Always - 12291
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 40% 13684 1385167592
# 2 Short offline Completed without error 00% 13682 -
# 3 Extended offline Completed: read failure 30% 13678 1385167592
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
于是我卸载了sda2
,取出备份并e2fsck -fccky
在其上运行了(内部运行badblocks
),它抱怨连续块组中的 IO 错误,同时还修复了文件系统。然后,希望这能有所帮助,我又进行了一次smartctl
长时间的测试,结果发现Current_Pending_Sector
增加到786
,而LBA_of_first_error
现在小了很多。
许多人认为 SMART 故障的硬盘就是死机(就像这里的许多答案一样),我准备放弃我的硬盘,直到我找到一个地方(与 WD 无关),他们声称他们可以'维修'我的硬盘,用一个叫的工具PC-3000
。他们完成了他们的工作,并说驱动器现在很健康,但我无法确认:我又进行了一次smartctl
长时间的测试,结果又是read failure
,但我之前的所有 SMART 报告都消失了,这一次,和Current_Pending_Sector
都是Reallocated_Event_Count
。我还在驱动器上0
运行了另一个,结果发现同样的 IO 错误。我甚至编辑了报告的块以确认它们无法读取。他们的技术人员无知地坚持认为我应该在驱动器上安装 Windows 以查看它是否正常工作。我确信 Windows 安装程序甚至无法在那里创建 NTFS 文件系统,我只是在错误点周围创建了一个 2M 的小分区(大约有 块),并在那里运行了一个完整的(归零)程序。badblocks
dd
744
512B
mkfs.ntfs
令我惊讶的是,文件系统已成功创建。我将其挂载,并能够读取/写入整个分区。我再次dd
编辑了那些坏块,这次成功读取了它们。最后,我又进行了一次smartctl
长时间的测试,也成功通过了。(尽管仍然很高。)在这里,您可以看到分别在之前和之后进行的测试和Raw_Read_Error_Rate
的结果。#2
#1
mkfs.ntfs
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10JPVX-08JC3T6
Serial Number: WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Oct 8 05:54:51 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (18000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 410
3 Spin_Up_Time 0x0027 189 184 021 Pre-fail Always - 1550
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 34
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 44
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 33
194 Temperature_Celsius 0x0022 108 095 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
240 Head_Flying_Hours 0x0032 100 100 000 Old_age Always - 34
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 44 -
# 2 Extended offline Completed: read failure 40% 29 1345188144
# 3 Short offline Completed without error 00% 27 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
然后我又badblocks
对整个驱动器运行了另一次,没有发现任何错误。
更新:我刚刚恢复了我的备份映像(这意味着在磁盘上每个虚拟可用的块上进行写入)并再次成功运行smartctl -t long
。
现在总结一下以上内容:
- 该驱动器的 SMART 出现故障,并且块出现 IO 错误,错误数量明显在增加,
- 我用这个对驱动器做了一些我不知道的事情
PC-3000
, - 驱动器处于相同状态,SMART 仍然出现故障,但之前的数据已经丢失,
- 我
mkfs.ntfs
编辑了错误的地方,并且 - 错误突然消失并且 SMART 测试成功通过。
- 请注意,我没有明确地写在错误的地方,尽管我猜
badblocks
无论如何都会这样做。
我的问题:
到底发生了什么,有什么解释吗?它真的损坏了吗?真的修好了吗?那怎么办呢?我的简单猜测是:
答:我只是误以为 SMART 测试失败了。它从一开始就很好。
B. 它
PC-3000
已经完成了它的工作,但是驱动器只是在等待对错误点的写入以执行修复操作。(例如重新映射块)- 我认为
mkfs.ntfs
除了在错误点写入零(或可能是文件系统的内容)之外没有做任何事情,对吗?
- 我认为
我的硬盘现在可靠吗?我可以无忧无虑地使用它吗?如果可以,这是否意味着 SMART 出现故障的硬盘可以修复?
这可能会
PC-3000
做什么?这真的是“硬件修复' 对于物理损坏的驱动器?
答案1
背景信息:现代 HDD 有一些备用扇区来替换坏扇区,每个扇区都有额外的位来检测错误并从较简单的错误中恢复。当 HDD 发现读取错误时,它不会立即重新定位以允许用户尝试恢复丢失的数据。但是,当在故障扇区上完成写入时,它会覆盖损坏的数据,检查数据是否正确写入,如果存在校验和错误,则重新定位故障扇区。它并不总是重新定位,因为问题可能是暂时的(例如,数据可能由于写入期间断电而损坏)。
现在回答你的问题:
- 智能测试失败意味着磁盘中有一个扇区存在无法恢复(使用内部恢复算法)的校验和错误,不一定意味着整个磁盘都坏了。有些组织会在此时更换磁盘以确保万无一失,但我见过一批有问题的 HDD 在最初几天就重新定位了某些扇区,并且它们在几年内都保持稳定并正常工作。
我不知道这张 PC-3000 卡有什么作用,但由于您仍然收到读取错误,因此似乎它并没有触发所有坏扇区的重新定位。
当您运行mkfs.ntfs
并要求将整个分区清零,并且之后针对问题区域的所有测试都通过时,似乎触发了坏扇区的重新定位。可能Reallocated_Event_Count
不再是零了。
可能不会。看起来坏扇区已经修复,但由于您的 仍然很高
Raw_Read_Error_Rate
,其他扇区最终也会发生故障,如果您Current_Pending_Sector
达到786
,则并非所有扇区都已重新映射,或者您的驱动器即将使用所有备用扇区。看起来您的驱动器即将寿终正寝。它可能会坚持一段时间,但我不会相信它能保存重要数据。我不知道。由于它是硬件解决方案并使用额外的连接器,也许它可以调整 HDD 固件或使用调试命令来访问原始数据和校验和。