2-3 周以来,我的主服务器一直无缘无故挂起。在此之前,它已连续运行了 4 个多月,没有出现任何问题。每次,只需简单重启即可解决问题。
当前设置:
- HP DL380 G5、2 x Xeon 4C 3GHz、16GB 内存、6 x 146GB(RAID 0+1)
- Slackware 14.0
我让服务器保持打开状态并运行 PuTTy,当它挂起时(每天大约 1 到 3 次),我看到负载很高,大约超过 60,所有 Web 服务(HTTP、DNS、SMTP、IMAP、POP3 等)都没有响应。使用 PuTTy 连接时,我可以登录,但提示从未出现,本地提示(键盘 + 屏幕)上也是一样。此外,我还看到驱动器上的绿色 LED 以大约 0.5Hz - 1Hz 的频率同时闪烁(通常它们闪烁得更快,并且顺序随机)。
我首先怀疑是 DDoS 攻击等,添加了许多 fail2ban 验证、外部防火墙 TCP 请求限制等。之后,我验证了固件版本(包括 P400),将所有固件升级到最新版本,但问题仍然存在。我还将根同步到另一台 DL380 G5(硬件相同,但有 4 个 450GB 驱动器)来替换服务器,再次出现同样的问题。
我使用 top、iostat、iotop 验证了,仍然没有线索。当负载高时,几乎没有 CPU 使用率(top)和磁盘活动(iostat)。
现在我想知道我使用的版本中的 CCISS 驱动程序是否存在问题?
以下是一些可能有用的信息:
控制器详细信息:
root@hyperion:~# hpapucli
=> ctrl all show status
Smart Array P400 in Slot 1
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK
=> ctrl all show detail
Smart Array P400 in Slot 1
Bus Interface: PCI
Slot: 1
Serial Number: P61620G9SVM38V
Cache Serial Number: PA2270H9SVI198
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: D
Firmware Version: 6.86
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
Total Cache Memory Available: 464 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
=> ctrl all show config
Smart Array P400 in Slot 1 (sn: P61620G9SVM38V)
array A (SAS, Unused Space: 0 MB)
logicaldrive 1 (838.3 GB, RAID 1+0, OK)
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 450 GB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 450 GB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 450 GB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 450 GB, OK)
驾驶员详细信息:
root@hyperion:~# modinfo cciss
filename: /lib/modules/3.2.29/kernel/drivers/block/cciss.ko
license: GPL
version: 3.6.26
description: Driver for HP Smart Array Controllers
author: Hewlett-Packard Company
srcversion: D553A90CDE37829B37A9C27
alias: pci:v0000103Cd00003230sv0000103Csd0000323Dbc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003237bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003215bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003214bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003213bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003212bc*sc*i*
alias: pci:v0000103Cd00003238sv0000103Csd00003211bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003235bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003234bc*sc*i*
alias: pci:v0000103Cd00003230sv0000103Csd00003223bc*sc*i*
alias: pci:v0000103Cd00003220sv0000103Csd00003225bc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Dbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Cbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Bbc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd0000409Abc*sc*i*
alias: pci:v00000E11d00000046sv00000E11sd00004091bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004083bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004082bc*sc*i*
alias: pci:v00000E11d0000B178sv00000E11sd00004080bc*sc*i*
alias: pci:v00000E11d0000B060sv00000E11sd00004070bc*sc*i*
depends:
intree: Y
vermagic: 3.2.29 SMP mod_unload
parm: cciss_tape_cmds:number of commands to allocate for tape devices (default: 6) (int)
parm: cciss_simple_mode:Use 'simple mode' rather than 'performant mode' (int)
挂起时顶部输出
top - 10:39:45 up 43 min, 2 users, load average: 24.58, 7.14, 2.88
Tasks: 282 total, 1 running, 281 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32894436k total, 17964512k used, 14929924k free, 97732k buffers
Swap: 0k total, 0k used, 0k free, 10694424k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3928 root 20 0 37164 2988 2444 S 0 0.0 0:00.41 sshd
4478 root 20 0 17608 1540 1060 R 0 0.0 0:07.62 top
1 root 20 0 4316 696 600 S 0 0.0 0:00.98 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/0
5 root 20 0 0 0 0 S 0 0.0 0:00.02 kworker/u:0
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
11 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
14 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/3
16 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3
17 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/4
19 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/4
20 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/5
22 root 20 0 0 0 0 S 0 0.0 0:00.01 ksoftirqd/5
23 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/6
25 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/6
26 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/7
28 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/7
29 root 0 -20 0 0 0 S 0 0.0 0:00.00 cpuset
30 root 0 -20 0 0 0 S 0 0.0 0:00.00 khelper
31 root 20 0 0 0 0 S 0 0.0 0:00.00 kdevtmpfs
32 root 0 -20 0 0 0 S 0 0.0 0:00.00 netns
33 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/u:1
495 root 20 0 0 0 0 D 0 0.0 0:05.24 sync_supers
497 root 20 0 0 0 0 S 0 0.0 0:00.00 bdi-default
499 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
654 root 0 -20 0 0 0 S 0 0.0 0:00.00 ata_sff
661 root 20 0 0 0 0 S 0 0.0 0:00.00 khubd
667 root 0 -20 0 0 0 S 0 0.0 0:00.00 md
676 root 20 0 0 0 0 S 0 0.0 0:00.40 kworker/3:1
677 root 20 0 0 0 0 S 0 0.0 0:00.12 kworker/4:1
678 root 20 0 0 0 0 S 0 0.0 0:00.65 kworker/5:1
679 root 20 0 0 0 0 S 0 0.0 0:00.16 kworker/6:1
680 root 20 0 0 0 0 S 0 0.0 0:00.21 kworker/7:1
774 root 0 -20 0 0 0 S 0 0.0 0:00.00 rpciod
826 root 20 0 0 0 0 S 0 0.0 0:00.00 khungtaskd
832 root 20 0 0 0 0 S 0 0.0 0:00.00 kswapd0
DL380 G6 与 P410i 迁移
我也尝试在另一台 HP 服务器上直接移动硬盘并/dev/cciss/c0d0*
通过/dev/sda*
和进行更改,/etc/fstab
但/etc/lilo.conf
问题仍然存在。
控制器详细信息:
注意:是的,缓存已被禁用,我现在根本没有该服务器的电池。
root@hyperion:~# modprobe sg
root@hyperion:~# hpacucli ctrl all show detail
Smart Array P410i in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: 50123456789ABCDE
Cache Serial Number: PAAVP9VYBAU0
RAID 6 (ADG) Status: Disabled
Controller Status: OK
Hardware Revision: C
Firmware Version: 6.64
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Queue Depth: Automatic
Monitor and Performance Delay: 60 min
Elevator Sort: Enabled
Degraded Performance Optimization: Disabled
Inconsistency Repair Policy: Disabled
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 100% Read / 0% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
Total Cache Memory Available: 400 MB
No-Battery Write Cache: Disabled
Battery/Capacitor Count: 0
SATA NCQ Supported: True
驾驶员详细信息:
root@hyperion:~# modinfo hpsa
filename: /lib/modules/3.2.29/kernel/drivers/scsi/hpsa.ko
license: GPL
version: 2.0.2-1
description: Driver for HP Smart Array Controller version 2.0.2-1
author: Hewlett-Packard Company
srcversion: 624DA19A5286F6BDA1645F3
alias: pci:v0000103Cd*sv*sd*bc01sc04i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003356bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003355bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003354bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003353bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003352bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003351bc*sc*i*
alias: pci:v0000103Cd0000323Bsv0000103Csd00003350bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003233bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd0000324Bbc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd0000324Abc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003249bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003247bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003245bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003243bc*sc*i*
alias: pci:v0000103Cd0000323Asv0000103Csd00003241bc*sc*i*
depends:
intree: Y
vermagic: 3.2.29 SMP mod_unload
parm: hpsa_allow_any:Allow hpsa driver to access unknown HP Smart Array hardware (int)
parm: hpsa_simple_mode:Use 'simple mode' rather than 'performant mode' (int)
可能的原因
昨天,通过对不同的进程进行测试,我禁用了服务器postfix
,服务器不再挂起。当我再次启动它时,服务器就挂起了。看起来配置不正确或发出了可疑的 smtp 请求。
答案1
HP ProLiant G5 服务器系列是相当老旧的设备,从任何合理角度来看都不再提供支持。该设备于 2009 年停产。
但是,如果您不介意不受支持并且系统已经是四代甚至更老了,那么服务器仍然可以运行。
对于你的情况,你正在与固件版本错误在 RAID 控制器上。我建议您将 RAID 控制器的固件更新为最新版本(2012 年)。
通常情况下,您可以在操作系统内执行此操作,但 HP 也完全不支持 Slackware。如果您能找到更新固件的方法,这很可能会解决问题。