由于磁盘错误，RAID1 重建失败

2024-5-29 • tag-icon

快速信息：Dell R410 配备 2x500GB 硬盘，采用 H700 适配器上的 RAID1

最近，服务器上 RAID1 阵列中的一个驱动器发生故障，我们将其称为驱动器 0。RAID 控制器将其标记为故障并将其置于离线状态。我用新磁盘（同一系列和制造商，只是更大）替换了故障磁盘，并将新磁盘配置为热备用。

立即开始从驱动器 1 进行重建，1.5 小时后我收到驱动器 1 发生故障的消息。服务器无响应（内核崩溃）并需要重新启动。考虑到此错误发生前半小时重建进度约为 40%，我估计新驱动器尚未同步，并尝试仅使用驱动器 1 重新启动。

RAID 控制器抱怨缺少 RAID 阵列，但它在驱动器 1 上找到了外部 RAID 阵列，我将其导入。服务器启动并运行（从降级的 RAID）。

这是磁盘的 SMART 数据。驱动器 0（最先发生故障的驱动器）

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    1
  3 Spin_Up_Time            POS--K   142   142   021    -    3866
  4 Start_Stop_Count        -O--CK   100   100   000    -    12
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   086   086   000    -    10432
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    11
192 Power-Off_Retract_Count -O--CK   200   200   000    -    10
193 Load_Cycle_Count        -O--CK   200   200   000    -    1
194 Temperature_Celsius     -O---K   112   106   000    -    31
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   198   000    -    3

和驱动器 1（在尝试重建之前控制器报告该驱动器运行正常）

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    35
  3 Spin_Up_Time            POS--K   143   143   021    -    3841
  4 Start_Stop_Count        -O--CK   100   100   000    -    12
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   086   086   000    -    10455
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    11
192 Power-Off_Retract_Count -O--CK   200   200   000    -    10
193 Load_Cycle_Count        -O--CK   200   200   000    -    1
194 Temperature_Celsius     -O---K   114   105   000    -    29
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    3
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0

在 SMART 的扩展错误日志中我发现：

驱动器 0 只有一个错误

Error 1 [0] occurred at disk power-on lifetime: 10282 hours (428 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 18 00 00 00 6a 24 20 40 00  Error: IDNF at LBA = 0x006a2420 = 6956064

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 60 00 f8 00 00 00 6a 24 20 40 00 17d+20:25:18.105  WRITE FPDMA QUEUED
  61 00 18 00 60 00 00 00 6a 24 00 40 00 17d+20:25:18.105  WRITE FPDMA QUEUED
  61 00 80 00 58 00 00 00 6a 23 80 40 00 17d+20:25:18.105  WRITE FPDMA QUEUED
  61 00 68 00 50 00 00 00 6a 23 18 40 00 17d+20:25:18.105  WRITE FPDMA QUEUED
  61 00 10 00 10 00 00 00 6a 23 00 40 00 17d+20:25:18.104  WRITE FPDMA QUEUED

但驱动器 1 有 883 个错误。我只看到最后几个，我能看到的所有错误如下所示：

Error 883 [18] occurred at disk power-on lifetime: 10454 hours (435 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 00 80 00 00 39 97 19 c2 40 00  Error: AMNF at LBA = 0x399719c2 = 966203842

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 00 00 00 39 97 19 80 40 00  1d+00:25:57.802  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  1d+00:25:57.779  READ LOG EXT
  60 00 80 00 00 00 00 39 97 19 80 40 00  1d+00:25:55.704  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 00  1d+00:25:55.681  READ LOG EXT
  60 00 80 00 00 00 00 39 97 19 80 40 00  1d+00:25:53.606  READ FPDMA QUEUED

鉴于这些错误，我有什么办法可以重建 RAID，还是应该备份、关闭服务器、用新磁盘替换磁盘并恢复？如果我从运行在 USB/CD 上的 Linux 将故障磁盘添加到新磁盘怎么办？

另外，如果有人有更多经验，这些错误可能是什么原因造成的？控制器或磁盘有问题？磁盘大约有 1 年的历史，但对我来说，这两块磁盘竟然在这么短的时间内坏掉，真是难以置信。

答案1

实际上，如果这两个磁盘是来自制造商的同一批次，那么它们同时出现故障也就不足为奇了。

它们拥有相同的制造流程、环境和使用模式。这就是为什么我通常会尝试从不同的供应商处订购相同型号的驱动器。

我首选的做法是联系制造商，更换更好的磁盘，然后从备份中恢复。

DD 也没有错，但我通常需要尽快启动服务。

回想 IBM Deskstars 惨败的那段时间，我使用了 4 年之后，一整套 8 个磁盘在 6 周内全部损坏。我勉强保住了数据。

答案1

相关内容