HP DL380 G5 - Smart Array P400 - Linux 随机挂起并负载过高

HP DL380 G5 - Smart Array P400 - Linux 随机挂起并负载过高

2-3 周以来,我的主服务器一直无缘无故挂起。在此之前,它已连续运行了 4 个多月,没有出现任何问题。每次,只需简单重启即可解决问题。

当前设置:

  • HP DL380 G5、2 x Xeon 4C 3GHz、16GB 内存、6 x 146GB(RAID 0+1)
  • Slackware 14.0

我让服务器保持打开状态并运行 PuTTy,当它挂起时(每天大约 1 到 3 次),我看到负载很高,大约超过 60,所有 Web 服务(HTTP、DNS、SMTP、IMAP、POP3 等)都没有响应。使用 PuTTy 连接时,我可以登录,但提示从未出现,本地提示(键盘 + 屏幕)上也是一样。此外,我还看到驱动器上的绿色 LED 以大约 0.5Hz - 1Hz 的频率同时闪烁(通常它们闪烁得更快,并且顺序随机)。

我首先怀疑是 DDoS 攻击等,添加了许多 fail2ban 验证、外部防火墙 TCP 请求限制等。之后,我验证了固件版本(包括 P400),将所有固件升级到最新版本,但问题仍然存在。我还将根同步到另一台 DL380 G5(硬件相同,但有 4 个 450GB 驱动器)来替换服务器,再次出现同样的问题。

我使用 top、iostat、iotop 验证了,仍然没有线索。当负载高时,几乎没有 CPU 使用率(top)和磁盘活动(iostat)。

现在我想知道我使用的版本中的 CCISS 驱动程序是否存在问题?

以下是一些可能有用的信息:

控制器详细信息:

root@hyperion:~# hpapucli

=> ctrl all show status

Smart Array P400 in Slot 1
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK

=> ctrl all show detail

Smart Array P400 in Slot 1
Bus Interface: PCI
Slot: 1
Serial Number: P61620G9SVM38V
Cache Serial Number: PA2270H9SVI198
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: D
Firmware Version: 6.86
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
Total Cache Memory Available: 464 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True

=> ctrl all show config

Smart Array P400 in Slot 1 (sn: P61620G9SVM38V)

array A (SAS, Unused Space: 0 MB)


logicaldrive 1 (838.3 GB, RAID 1+0, OK)

physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 450 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 450 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 450 GB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 450 GB, OK)

驾驶员详细信息:

root@hyperion:~# modinfo cciss
filename: /lib/modules/3.2.29/kernel/drivers/block/cciss.ko
license: GPL
version: 3.6.26
description: Driver for HP Smart Array Controllers
author: Hewlett-Packard Company
srcversion: D553A90CDE37829B37A9C27
alias: pci:v0000103Cd00003230sv0000103Csd0000323Dbc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003237bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003215bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003214bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003213bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003212bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003211bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003235bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003234bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003223bc*sc*i*
alias: pci:v0000103Cd00003220sv0000103Csd00003225bc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Dbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Cbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Bbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Abc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd00004091bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004083bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004082bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004080bc*sc*i*
alias: pci:v00000E11d0000B060sv00000E11sd00004070bc*sc*i*
depends:
intree: Y
vermagic: 3.2.29 SMP mod_unload
parm: cciss_tape_cmds:number of commands to allocate for tape devices (default: 6) (int)
parm: cciss_simple_mode:Use 'simple mode' rather than 'performant mode' (int)

挂起时顶部输出

top - 10:39:45 up 43 min,  2 users,  load average: 24.58, 7.14, 2.88
Tasks: 282 total,   1 running, 281 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32894436k total, 17964512k used, 14929924k free,    97732k buffers
Swap:        0k total,        0k used,        0k free, 10694424k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3928 root      20   0 37164 2988 2444 S    0  0.0   0:00.41 sshd
 4478 root      20   0 17608 1540 1060 R    0  0.0   0:07.62 top
    1 root      20   0  4316  696  600 S    0  0.0   0:00.98 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/0
    5 root      20   0     0    0    0 S    0  0.0   0:00.02 kworker/u:0
    6 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    7 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/1
   11 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/2
   14 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/3
   16 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/3
   17 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/4
   19 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/4
   20 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/5
   22 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/5
   23 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/6
   25 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/6
   26 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/7
   28 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/7
   29 root       0 -20     0    0    0 S    0  0.0   0:00.00 cpuset
   30 root       0 -20     0    0    0 S    0  0.0   0:00.00 khelper
   31 root      20   0     0    0    0 S    0  0.0   0:00.00 kdevtmpfs
   32 root       0 -20     0    0    0 S    0  0.0   0:00.00 netns
   33 root      20   0     0    0    0 S    0  0.0   0:00.00 kworker/u:1
  495 root      20   0     0    0    0 D    0  0.0   0:05.24 sync_supers
  497 root      20   0     0    0    0 S    0  0.0   0:00.00 bdi-default
  499 root       0 -20     0    0    0 S    0  0.0   0:00.00 kblockd
  654 root       0 -20     0    0    0 S    0  0.0   0:00.00 ata_sff
  661 root      20   0     0    0    0 S    0  0.0   0:00.00 khubd
  667 root       0 -20     0    0    0 S    0  0.0   0:00.00 md
  676 root      20   0     0    0    0 S    0  0.0   0:00.40 kworker/3:1
  677 root      20   0     0    0    0 S    0  0.0   0:00.12 kworker/4:1
  678 root      20   0     0    0    0 S    0  0.0   0:00.65 kworker/5:1
  679 root      20   0     0    0    0 S    0  0.0   0:00.16 kworker/6:1
  680 root      20   0     0    0    0 S    0  0.0   0:00.21 kworker/7:1
  774 root       0 -20     0    0    0 S    0  0.0   0:00.00 rpciod
  826 root      20   0     0    0    0 S    0  0.0   0:00.00 khungtaskd
  832 root      20   0     0    0    0 S    0  0.0   0:00.00 kswapd0

DL380 G6 与 P410i 迁移

我也尝试在另一台 HP 服务器上直接移动硬盘并/dev/cciss/c0d0*通过/dev/sda*和进行更改,/etc/fstab/etc/lilo.conf问题仍然存在。

控制器详细信息:

注意:是的,缓存已被禁用,我现在根本没有该服务器的电池。

root@hyperion:~# modprobe sg
root@hyperion:~# hpacucli ctrl all show detail

Smart Array P410i in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   Serial Number: 50123456789ABCDE
   Cache Serial Number: PAAVP9VYBAU0
   RAID 6 (ADG) Status: Disabled
   Controller Status: OK
   Hardware Revision: C
   Firmware Version: 6.64
   Rebuild Priority: Medium
   Expand Priority: Medium
   Surface Scan Delay: 15 secs
   Surface Scan Mode: Idle
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Cache Ratio: 100% Read / 0% Write
   Drive Write Cache: Disabled
   Total Cache Size: 512 MB
   Total Cache Memory Available: 400 MB
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0
   SATA NCQ Supported: True

驾驶员详细信息:

root@hyperion:~# modinfo hpsa
filename:       /lib/modules/3.2.29/kernel/drivers/scsi/hpsa.ko
license:        GPL
version:        2.0.2-1
description:    Driver for HP Smart Array Controller version 2.0.2-1
author:         Hewlett-Packard Company
srcversion:     624DA19A5286F6BDA1645F3
alias:          pci:v0000103Cd*sv*sd*bc01sc04i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003356bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003355bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003354bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003353bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003352bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003351bc*sc*i*
alias:          pci:v0000103Cd0000323Bsv0000103Csd00003350bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003233bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd0000324Bbc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd0000324Abc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003249bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003247bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003245bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003243bc*sc*i*
alias:          pci:v0000103Cd0000323Asv0000103Csd00003241bc*sc*i*
depends:
intree:         Y
vermagic:       3.2.29 SMP mod_unload
parm:           hpsa_allow_any:Allow hpsa driver to access unknown HP Smart Array hardware (int)
parm:           hpsa_simple_mode:Use 'simple mode' rather than 'performant mode' (int)

可能的原因

昨天,通过对不同的进程进行测试,我禁用了服务器postfix,服务器不再挂起。当我再次启动它时,服务器就挂起了。看起来配置不正确或发出了可疑的 smtp 请求。

答案1

HP ProLiant G5 服务器系列是相当老旧的设备,从任何合理角度来看都不再提供支持。该设备于 2009 年停产。

但是,如果您不介意不受支持并且系统已经是四代甚至更老了,那么服务器仍然可以运行。

对于你的情况,你正在与固件版本错误在 RAID 控制器上。我建议您将 RAID 控制器的固件更新为最新版本(2012 年)

通常情况下,您可以在操作系统内执行此操作,但 HP 也完全不支持 Slackware。如果您能找到更新固件的方法,这很可能会解决问题。


在此处输入图片描述

相关内容