两个 RAID6 阵列中的许多驱动器同时发生故障，除了 SMART 长测试外，重启后似乎可以正常工作

2024-6-2 • tag-icon

两个 RAID6 阵列中的许多驱动器同时发生故障，除了 SMART 长测试外，重启后似乎可以正常工作

在我的存储服务器中，我运行了三个 RAID6 Linux 软件阵列。一切都运行正常，直到出现故障。

有两个 RAID6 阵列和一个 RAID5 阵列，均由 SATA 驱动器组成，均连接到 HBA9500-16i 控制器。突然，一个 RAID6 和一个 RAID5 阵列的多个驱动器开始显示以下内容：

May 15 01:20:07 xxxstor kernel: [42205.209000] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:07 xxxstor kernel: [42205.309428] sd 8:0:6:0: Power-on or device reset occurred
May 15 01:20:19 xxxstor kernel: [42217.044287] sd 8:0:8:0: [sdk] tag#1591 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
May 15 01:20:19 xxxstor kernel: [42217.044294] sd 8:0:8:0: [sdk] tag#1591 CDB: Read(16) 88 00 00 00 00 01 47 85 00 58 00 00 00 08 00 00
May 15 01:20:19 xxxstor kernel: [42217.044297] print_req_error: I/O error, dev sdk, sector 5494866008
May 15 01:20:19 xxxstor kernel: [42217.044361] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:19 xxxstor kernel: [42217.055768] sd 8:0:8:0: Power-on or device reset occurred
May 15 01:20:20 xxxstor kernel: [42217.758365] mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x010c)
May 15 01:20:20 xxxstor kernel: [42217.825959] sd 8:0:8:0: Power-on or device reset occurred

此后，这些阵列中的几个驱动器被标记为故障，并启动了备用驱动器的自动替换。然而，新使用的备用驱动器也开始显示 I/O 错误，被标记为故障，恢复停止。当我早上发现这种情况时，大多数驱动器都被标记为故障，阵列似乎无法恢复。故障硬盘在其 SMART 日志中显示各种错误：

Error 503 occurred at disk power-on lifetime: 22577 hours (940 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 10 00 ac 3c 40 00      19:33:18.999  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      19:33:18.999  READ LOG EXT
  61 00 30 00 a4 3c 40 00      19:33:18.996  WRITE FPDMA QUEUED
  61 00 28 00 bc 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED
  61 00 20 00 a0 3c 40 00      19:33:18.994  WRITE FPDMA QUEUED

或扩展：

0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS

其他驱动器读取的 SMART 日志

Error 2 occurred at disk power-on lifetime: 19503 hours (812 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 00 80 80 00 40 00      18:18:52.230  WRITE FPDMA QUEUED
  2f 00 01 10 00 00 00 00      18:18:52.230  READ LOG EXT
  61 80 08 80 d6 6e 40 00      18:18:52.230  WRITE FPDMA QUEUED
  ef 10 02 00 00 00 00 00      18:18:52.227  SET FEATURES [Enable SATA feature]
  ef 02 00 00 00 00 00 00      18:18:52.224  SET FEATURES [Enable write cache]

并且相应的扩展日志看起来与前一个类似。

我看到的发生故障的驱动器和未发生故障的驱动器之间的唯一区别是 SMART 属性 199 (UDMA_CRC_Error_Count)。在发生故障的驱动器中，该值不为零。在仍正常的驱动器中，该值是零。

在我重新启动系统后（我无法对系统做任何事情），所有驱动器上的故障标记都消失了，我能够重新组装阵列并开始自动重建。

所以我的问题是：这种不太可能发生的事件真的发生了吗？多个驱动器是否恰好同时发生故障？还是 HBA 控制器和/或背板出现故障，导致如此多的驱动器同时出现故障？

如果控制器损坏，尽管有 SMART 日志，驱动器还能被信任吗？或者我应该只保存数据并删除驱动器？

如果控制器坏了，我应该更换它还是尝试更新控制器卡或 Linux 驱动程序的固件/BIOS？

我将非常感谢任何提示。内核版本为 4.19.181，mpt3sas 驱动程序版本为 35.00.00.00。谢谢。

编辑：与此同时，我意识到所有报告一些 SMART 问题（UDMA_CRC、日志中的错误等）的 HDD 都在服务器的后面板上。前面板上的驱动器都很好，没有问题。同一个 HBA 控制器控制两个背板。

相关内容