RAID 10 中的 4 个 HDD 每次重启时,一个特定磁盘都会降级。磁盘坏了吗?包含 Smartmoontools 信息

RAID 10 中的 4 个 HDD 每次重启时,一个特定磁盘都会降级。磁盘坏了吗?包含 Smartmoontools 信息

这一切开始于大约 6 个月前,当时我换了一块新主板。RAID 以前是我启动时的主驱动器。我升级了 CPU,也就是主板升级,同时我还换了一个 SSD 作为主驱动器。我没有做任何特别的事情来带上新的 RAID,我把驱动器放进去,安装了 RAID 驱动程序,然后重新启动以在 BIOS 中打开 RAID。除了这个驱动器在第一次启动时显示降级外,其他一切都完好无损。

一旦我加载到 Windows,Intel RST 软件就会自动开始修复,修复需要 2 个小时左右才能完成。修复之后一切正常,所以我关机了。我继续打开机箱,重新检查所有电缆/连接/驱动器位置等。一切都很好。

然后它在下次启动时再次显示为性能下降,并且再次顺利修复。因此我切换了驱动器所在的端口,但同样的问题仍然存在,同一个驱动器的性能下降。我将任何真正重要的文件从 RAID 中移出,从那时起我就一直将其用作存储。在 RAID 上存储或检索文件时,没有注意到重新启动之间有任何问题。无论如何,在我重新启动驱动器的那些罕见时刻,驱动器报告再次像时钟一样性能下降。

在维修期间,驱动器上会标有警告,但没有其他相关信息。维修后英特尔的 RST 工具看起来像这样。各个磁盘均报告状态:正常。

经过一段时间后,我考虑拆解 RAID,但在此之前,我想知道这个驱动器到底出了什么问题,因为在我更换主板之前,一切都运行正常,没有任何问题,这似乎非常巧合。旧驱动器,但如果它能用,我想保留它以备不时之需或临时存储。

无论如何,我无法单独使用 Intel RST 找到太多信息,并且找不到日志,所以我找到了 smartmontools 并尝试使用命令行

smartctl -a /dev/csmi0,3

0.3 为故障驱动器,其结果如下:

smartctl 6.5 2016-05-07 r4318 [x86_64-w64-mingw32-win7-sp1] (sf-6.5-1)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Maxtor DiamondMax 10 (ATA/133 and SATA/150)
Device Model:     Maxtor 6B300S0
Serial Number:    B60NDB2H
Firmware Version: BANC1B70
User Capacity:    300,090,728,448 bytes [300 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 T13/1532D revision 0
Local Time is:    Mon Jan 16 23:28:51 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 2283) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp
ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 118) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_
FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   183   149   063    Pre-fail  Always       -
       28226
  4 Start_Stop_Count        0x0032   250   250   000    Old_age   Always       -
       6749
  5 Reallocated_Sector_Ct   0x0033   140   140   063    Pre-fail  Always       -
       1149
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail  Offline      -
       0
  7 Seek_Error_Rate         0x000a   253   247   000    Old_age   Always       -
       0
  8 Seek_Time_Performance   0x0027   250   231   187    Pre-fail  Always       -
       56890
  9 Power_On_Minutes        0x0032   001   001   000    Old_age   Always       -
       851h+19m
 10 Spin_Retry_Count        0x002b   253   209   157    Pre-fail  Always       -
       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail  Always       -
       0
 12 Power_Cycle_Count       0x0032   251   251   000    Old_age   Always       -
       1014
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   Always       -
       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   Always       -
       0
194 Temperature_Celsius     0x0032   044   253   000    Old_age   Always       -
       48
195 Hardware_ECC_Recovered  0x000a   253   212   000    Old_age   Always       -
       22159
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -
       0
197 Current_Pending_Sector  0x0008   253   249   000    Old_age   Offline      -
       0
198 Offline_Uncorrectable   0x0008   253   252   000    Old_age   Offline      -
       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age   Offline      -
       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -
       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   Always       -
       0
202 Data_Address_Mark_Errs  0x000a   253   233   000    Old_age   Always       -
       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail  Always       -
       1
204 Soft_ECC_Correction     0x000a   253   252   000    Old_age   Always       -
       0
205 Thermal_Asperity_Rate   0x000a   253   252   000    Old_age   Always       -
       0
207 Spin_High_Current       0x002a   253   209   000    Old_age   Always       -
       0
208 Spin_Buzz               0x002a   253   252   000    Old_age   Always       -
       0
209 Offline_Seek_Performnce 0x0024   240   240   000    Old_age   Offline      -
       161
210 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -
       0
211 Unknown_Attribute       0x0032   253   252   000    Old_age   Always       -
       0
212 Unknown_Attribute       0x0032   253   253   000    Old_age   Always       -
       0

Read SMART Log Directory failed: CSMI(25) failed with ReturnCode=1

SMART Error Log Version: 1
ATA Error Count: 368 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 368 occurred at disk power-on lifetime: 11897 hours (495 days + 17 hours)
  When the command that caused the error occurred, the device was in an unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  78 00 01 10 00 00 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 01 10 00 00 40 00      02:43:33.823  NOP [Abort queued commands]
  00 80 40 00 19 d0 40 00      02:43:32.365  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 38 80 18 d0 40 00      02:43:32.363  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 38 80 18 d0 40 00      02:43:32.363  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 30 00 18 d0 40 00      02:43:32.361  NOP [Reserved subcommand] [OBS-ACS-
2]

Error 367 occurred at disk power-on lifetime: 11897 hours (495 days + 17 hours)
  When the command that caused the error occurred, the device was in an unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  78 00 01 10 00 00 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 01 10 00 00 40 00      02:43:30.598  NOP [Abort queued commands]
  00 80 98 00 ff cf 40 00      02:43:26.799  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 90 80 fe cf 40 00      02:43:26.798  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 90 80 fe cf 40 00      02:43:26.797  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 88 00 fe cf 40 00      02:43:26.796  NOP [Reserved subcommand] [OBS-ACS-
2]

Error 366 occurred at disk power-on lifetime: 11897 hours (495 days + 17 hours)
  When the command that caused the error occurred, the device was in an unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  78 00 01 10 00 00 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 01 10 00 00 40 00      02:43:23.357  NOP [Abort queued commands]
  00 80 40 00 fa cf 40 00      02:43:21.905  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 38 80 f9 cf 40 00      02:43:21.904  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 38 80 f9 cf 40 00      02:43:21.903  NOP [Reserved subcommand] [OBS-ACS-
2]
  00 80 30 00 f9 cf 40 00      02:43:21.902  NOP [Reserved subcommand] [OBS-ACS-
2]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed without error       00%     25973         -
# 2  Short offline       Completed: read failure       60%      8088         311
4240

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

有趣的是,最初只有 1 个简短测试显示读取失败。手动运行另一个测试未显示任何失败。我也分析了其他驱动器,但没有一个出现预先失败的情况,错误情况,所以这里有些不对劲。

有谁对此更了解,认为这些信息指向磁盘故障还是有其他原因?如果发生故障,修复过程正在做什么?修复后阵列是否真的像 RAID 10 一样通过正确的条带化+镜像运行?

答案1

那个英特尔软件很糟糕。我可以从我对 RAID5 的使用经验中确认这一点。要正确进行检查,请获取 HDTune 并对每个驱动器执行驱动器检查。您还可以使用它直接读取相关的 SMART 信息(如果您有任何警告,它们将在 SMART 状态中标记为黄色)。在 SMART 状态下,请确保检查是否有任何重新分配的扇区。如果没有,也没有坏扇区,则驱动器没有问题。

相关内容