我的 SMART 故障的 HDD 真的修好了吗?

我的 SMART 故障的 HDD 真的修好了吗?

简单来说:我想知道 SMART 出现故障的 HDD 是否可以通过任何方式修复,如果可以,它是否仍然足够可靠。

详细信息:我有一块使用了 4 年的 1TB 西部数据硬盘 (WD10JPVX-08JC3T6),之前没有出现过问题。

disk /dev/sda: 931.5 gib, 1000204886016 bytes, 1953525168 sectors
units: sectors of 1 * 512 = 512 bytes
sector size (logical/physical): 512 bytes / 4096 bytes
i/o size (minimum/optimal): 4096 bytes / 4096 bytes
disklabel type: gpt
disk identifier: c700a041-8c28-42e8-9adb-24a5f86b961a

device          start        end    sectors   size type
/dev/sda1        2048    1050623    1048576   512m efi system
/dev/sda2     1050624 1936945151 1935894528 923.1g linux filesystem
/dev/sda3  1936945152 1953523711   16578560   7.9g linux swap

突然(值得一提:可能是在高湿度条件下)我发现我的 Debian 根分区sda2是只读的。我进行了一次长时间samrtctl的测试,该测试已完成read failureCurrent_Pending_Sector并且109是。Reallocated_Event_Count0

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD10JPVX-08JC3T6
Serial Number:    WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 23 14:49:11 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (18000) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 202) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x7035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       677
  3 Spin_Up_Time            0x0027   189   179   021    Pre-fail  Always       -       1533
  4 Start_Stop_Count        0x0032   091   091   000    Old_age   Always       -       9943
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13700
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1834
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       119
193 Load_Cycle_Count        0x0032   139   139   000    Old_age   Always       -       185919
194 Temperature_Celsius     0x0022   113   089   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       109
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1
240 Head_Flying_Hours       0x0032   084   084   000    Old_age   Always       -       12291

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       40%     13684         1385167592
# 2  Short offline       Completed without error       00%     13682         -
# 3  Extended offline    Completed: read failure       30%     13678         1385167592

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

于是我卸载了sda2,取出备份并e2fsck -fccky在其上运行了(内部运行badblocks),它抱怨连续块组中的 IO 错误,同时还修复了文件系统。然后,希望这能有所帮助,我又进行了一次smartctl长时间的测试,结果发现Current_Pending_Sector增加到786,而LBA_of_first_error现在小了很多。

许多人认为 SMART 故障的硬盘就是死机(就像这里的许多答案一样),我准备放弃我的硬盘,直到我找到一个地方(与 WD 无关),他们声称他们可以'维修'我的硬盘,用一个叫的工具PC-3000。他们完成了他们的工作,并说驱动器现在很健康,但我无法确认:我又进行了一次smartctl长时间的测试,结果又是read failure,但我之前的所有 SMART 报告都消失了,这一次,和Current_Pending_Sector都是Reallocated_Event_Count。我还在驱动器上0运行了另一个,结果发现同样的 IO 错误。我甚至编辑了报告的块以确认它们无法读取。他们的技术人员无知地坚持认为我应该在驱动器上安装 Windows 以查看它是否正常工作。我确信 Windows 安装程序甚至无法在那里创建 NTFS 文件系统,我只是在错误点周围创建了一个 2M 的小分区(大约有 块),并在那里运行了一个完整的(归零)程序。badblocksdd744512Bmkfs.ntfs令我惊讶的是,文件系统已成功创建。我将其挂载,并能够读取/写入整个分区。我再次dd编辑了那些坏块,这次成功读取了它们。最后,我又进行了一次smartctl长时间的测试,也成功通过了。(尽管仍然很高。)在这里,您可以看到分别在之前和之后进行的测试和Raw_Read_Error_Rate的结果。#2#1mkfs.ntfs

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD10JPVX-08JC3T6
Serial Number:    WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct  8 05:54:51 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (18000) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 202) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x7035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       410
  3 Spin_Up_Time            0x0027   189   184   021    Pre-fail  Always       -       1550
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       44
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       33
194 Temperature_Celsius     0x0022   108   095   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
240 Head_Flying_Hours       0x0032   100   100   000    Old_age   Always       -       34

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        44         -
# 2  Extended offline    Completed: read failure       40%        29         1345188144
# 3  Short offline       Completed without error       00%        27         -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

然后我又badblocks对整个驱动器运行了另一次,没有发现任何错误。

更新:我刚刚恢复了我的备份映像(这意味着在磁盘上每个虚拟可用的块上进行写入)并再次成功运行smartctl -t long

现在总结一下以上内容:

  1. 该驱动器的 SMART 出现故障,并且块出现 IO 错误,错误数量明显在增加,
  2. 我用这个对驱动器做了一些我不知道的事情PC-3000
  3. 驱动器处于相同状态,SMART 仍然出现故障,但之前的数据已经丢失,
  4. mkfs.ntfs编辑了错误的地方,并且
  5. 错误突然消失并且 SMART 测试成功通过。
  • 请注意,我没有明确地写在错误的地方,尽管我猜badblocks无论如何都会这样做。

我的问题:

  1. 到底发生了什么,有什么解释吗?它真的损坏了吗?真的修好了吗?那怎么办呢?我的简单猜测是:

    答:我只是误以为 SMART 测试失败了。它从一开始就很好。

    B. 它PC-3000已经完成了它的工作,但是驱动器只是在等待对错误点的写入以执行修复操作。(例如重新映射块)

    • 我认为mkfs.ntfs除了在错误点写入零(或可能是文件系统的内容)之外没有做任何事情,对吗?
  2. 我的硬盘现在可靠吗?我可以无忧无虑地使用它吗?如果可以,这是否意味着 SMART 出现故障的硬盘可以修复?

  3. 这可能会PC-3000做什么?这真的是“硬件修复' 对于物理损坏的驱动器?

答案1

背景信息:现代 HDD 有一些备用扇区来替换坏扇区,每个扇区都有额外的位来检测错误并从较简单的错误中恢复。当 HDD 发现读取错误时,它不会立即重新定位以允许用户尝试恢复丢失的数据。但是,当在故障扇区上完成写入时,它会覆盖损坏的数据,检查数据是否正确写入,如果存在校验和错误,则重新定位故障扇区。它并不总是重新定位,因为问题可能是暂时的(例如,数据可能由于写入期间断电而损坏)。

现在回答你的问题:

  1. 智能测试失败意味着磁盘中有一个扇区存在无法恢复(使用内部恢复算法)的校验和错误,不一定意味着整个磁盘都坏了。有些组织会在此时更换磁盘以确保万无一失,但我见过一批有问题的 HDD 在最初几天就重新定位了某些扇区,并且它们在几年内都保持稳定并正常工作。

我不知道这张 PC-3000 卡有什么作用,但由于您仍然收到读取错误,因此似乎它并没有触发所有坏扇区的重新定位。

当您运行mkfs.ntfs并要求将整个分区清零,并且之后针对问题区域的所有测试都通过时,似乎触发了坏扇区的重新定位。可能Reallocated_Event_Count不再是零了。

  1. 可能不会。看起来坏扇区已经修复,但由于您的 仍然很高Raw_Read_Error_Rate,其他扇区最终也会发生故障,如果您Current_Pending_Sector达到786,则并非所有扇区都已重新映射,或者您的驱动器即将使用所有备用扇区。看起来您的驱动器即将寿终正寝。它可能会坚持一段时间,但我不会相信它能保存重要数据。

  2. 我不知道。由于它是硬件解决方案并使用额外的连接器,也许它可以调整 HDD 固件或使用调试命令来访问原始数据和校验和。

相关内容