HP P840 HDD RAID 5 出现许多奇怪的驱动器故障

HP P840 HDD RAID 5 出现许多奇怪的驱动器故障

我已经在 HP P840 上使用 RAID5 HDD 存储 (8x6TB) 大约 2 年了,它总是出现异常多的驱动器故障。半年来一切都很好,但现在驱动器以奇怪的方式出现故障。例如,2 个新驱动器在添加到 RAID 几天后出现故障。我还已经更换了 RAID 控制器,并在主板和 RAID 控制器上使用最新固件。

我也尝试过使用不同的驱动器。最初,该 RAID 使用的是 HGST DeskStar 6TB 驱动器,现在我在更换故障驱动器时都用 HGST UltraStar 6TB 替换它们。但行为是一样的。

而且看起来(大多数)驱动器并没有真正发生故障,因为一旦我更换了 RAID 控制器,一个故障驱动器就会再次被识别为正常,并开始重建。

我的主机提供商的支持人员说问题在于我实际上使用的是 RAID5,我应该改用 RAID10。我很难相信,因为我在其他系统上使用 RAID5 时没有出现问题(多年来没有出现过驱动器故障)。

有人能给我提示一下,罪魁祸首可能是什么吗?RAID 控制器的配置方式有问题吗?

谢谢你!

编辑:
服务器是 HP DL180 G9
驱动器故障的原因始终是“写入重试失败”

更新:我们的主机提供商建议我们完全更换硬件并切换到 RAID6。我们照做了,现在运行顺畅了一段时间。虽然没有真正调查过,但我认为 shodanshok 关于穿孔阵列的解释似乎合理。因此,我会接受这个答案。谢谢大家!

  Smart Array P840 in Slot 1                (sn: PDNNF0ARH321GD)


     Port Name: 1I

     Port Name: 2I

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 2I, Box 1, OK
     array A (Solid State SATA, Unused Space: 0  MB)


  logicaldrive 1 (447.1 GB, RAID 1+0, OK)

  physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)

     array B (SATA, Unused Space: 0  MB)


  logicaldrive 2 (38.2 TB, RAID 5, Interim Recovery Mode)

  physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:5 (port 1I:box 2:bay 5, SATA, 6001.1 GB, Failed)
  physicaldrive 1I:2:6 (port 1I:box 2:bay 6, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:7 (port 1I:box 2:bay 7, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:8 (port 1I:box 2:bay 8, SATA, 6001.1 GB, OK)

详细信息:

     Smart Array P840 in Slot 1
        Bus Interface: PCI
        Slot: 1
        Serial Number: PDNNF0ARH321GD
        Cache Serial Number: PEYFP0BRH323YZ
        RAID 6 (ADG) Status: Enabled
        Controller Status: OK
        Hardware Revision: B
        Firmware Version: 6.60
        Rebuild Priority: High
        Expand Priority: Medium
        Surface Scan Delay: 3 secs
        Surface Scan Mode: Idle
        Parallel Surface Scan Supported: Yes
        Current Parallel Surface Scan Count: 1
        Max Parallel Surface Scan Count: 16
        Queue Depth: Automatic
        Monitor and Performance Delay: 60  min
        Elevator Sort: Enabled
        Degraded Performance Optimization: Disabled
        Inconsistency Repair Policy: Disabled
        Wait for Cache Room: Disabled
        Surface Analysis Inconsistency Notification: Disabled
        Post Prompt Timeout: 15 secs
        Cache Board Present: True
     Cache Status: OK
     Cache Ratio: 10% Read / 90% Write
     Drive Write Cache: Enabled
     Total Cache Size: 4.0 GB
     Total Cache Memory Available: 3.2 GB
     No-Battery Write Cache: Enabled
     SSD Caching RAID5 WriteBack Enabled: True
     SSD Caching Version: 2
     Cache Backup Power Source: Batteries
     Battery/Capacitor Count: 1
     Battery/Capacitor Status: OK
     SATA NCQ Supported: True
     Spare Activation Mode: Activate on physical drive failure (default)
     Controller Temperature (C): 51
     Cache Module Temperature (C): 38
     Number of Ports: 2 Internal only
     Encryption: Disabled
     Express Local Encryption: False
     Driver Name: hpsa
     Driver Version: 3.4.16
     Driver Supports HP SSD Smart Path: True
     PCI Address (Domain:Bus:Device.Function): 0000:06:00.0
     Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
     Controller Mode: RAID
     Controller Mode Reboot: Not Required
     Latency Scheduler Setting: Disabled
     Current Power Mode: MaxPerformance
     Host Serial Number: CZ270500GM
     Sanitize Erase Supported: False
     Primary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)
     Secondary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)


     Port Name: 1I
           Port ID: 0
           Port Connection Number: 0
           SAS Address: 5001438038AD05A0
           Port Location: Internal
           Managed Cable Connected: False

     Port Name: 2I
           Port ID: 1
           Port Connection Number: 1
           SAS Address: 5001438038AD05A8
           Port Location: Internal
           Managed Cable Connected: False

     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 2I, Box 1, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 2I
        Box: 1
        Location: Internal

     Physical Drives
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
        None attached

     Array: A
        Interface Type: Solid State SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 894.2 GB (100.0%)
        Status: OK
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable



  Logical Drive: 1
     Size: 447.1 GB
     Fault Tolerance: 1+0
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 512 KB
     Status: OK
     MultiDomain Status: OK
     Caching:  Enabled
     Unique Identifier: 600508B1001CE0F9FACF3A1358647115
     Disk Name: /dev/sda
     Mount Points: / 18.6 GB Partition Number 2
     OS Status: LOCKED
     Logical Drive Label: 0216D6F9PDNNF0ARH502MC7DFA
     Mirror Group 1:
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
     Mirror Group 2:
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 2I:1:1
     Port: 2I
     Box: 1
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004AG240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 39
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:2
     Port: 2I
     Box: 1
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV706303CH240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 36
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:3
     Port: 2I
     Box: 1
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712003V8240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 35
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:4
     Port: 2I
     Box: 1
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004GA240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 37
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False


     Array: B
        Interface Type: SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 43.7 TB (100.0%)
        Status: Failed Physical Drive
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable

        Warning: One of the drives on this array have failed or has been removed.




  Logical Drive: 2
     Size: 38.2 TB
     Fault Tolerance: 5
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 1792 KB
     Status: Interim Recovery Mode
     MultiDomain Status: OK
     Caching:  Enabled
     Parity Initialization Status: Initialization Failed
     Unique Identifier: 600508B1001CF94F84873C91FD89B549
     Disk Name: /dev/sdb
     Mount Points: None
     Logical Drive Label: 04DA1DD6PDNNF0ARH502MC546F
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 1I:2:1
     Port: 1I
     Box: 2
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHN3UZY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:2
     Port: 1I
     Box: 2
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNT517
     Serial Number: NAHLKP0X
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 56
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:3
     Port: 1I
     Box: 2
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: NCH8E81Z
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:4
     Port: 1I
     Box: 2
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHYMAUY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:5
     Port: 1I
     Box: 2
     Bay: 5
     Status: Failed
     Last Failure Reason: Write retries failed
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H942MD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Applicable
     Sanitize Erase Supported: False

  physicaldrive 1I:2:6
     Port: 1I
     Box: 2
     Bay: 6
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: TDR2
     Serial Number: K8JM5TKN
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 38
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:7
     Port: 1I
     Box: 2
     Bay: 7
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: K8H9BW2N
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 39
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:8
     Port: 1I
     Box: 2
     Bay: 8
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H623JD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 35
     Maximum Temperature (C): 40
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

答案1

你可能有一个严重的穿孔阵列,这会导致替换磁盘因条带重建失败而提前“计划死亡”。您可以阅读更多信息这里这里

解决方案是备份、销毁阵列、重新创建并从备份中恢复。

下次避免使用具有如此大驱动器的 RAID5 阵列。我强烈建议使用 RAID6,或者更好的 RAID10。

答案2

您应该使用 RAID6 以及系统中磁盘的大小和类型。但是,在 HP Smart Array RAID 控制器上运行 RAID5 本身并没有什么问题。我认为您的问题是由于在未认证服务器硬件的设置中使用了消费者磁盘造成的。

不过,有关服务器的一些详细信息可能会有帮助。

这是 HPE 服务器吗,或者您只是使用 HPE 控制器?

这些看起来不像是 HPE 驱动器或 HPE 驱动器托架。这是一个不好的迹象。

您提供的输出hpssacli还会显示磁盘故障的原因。如果您不在 HPE 服务器上,并且存在背板问题或 SATA 超时(注意到您在 SATA 磁盘上),则可能会出现误报。

例子:(参见上次失败原因行)

  physicaldrive 2I:2:8
     Port: 2I
     Box: 2
     Bay: 8
     Status: Failed
     Last Failure Reason: Aborted Command
     Drive Type: Data Drive

相关内容