服务器启动不到 10 分钟,但 top 显示所有进程的 CPU 使用时间极高[1](使用了超过百万小时),这是一台 24 核机器。系统最终在 10-15 分钟内崩溃。断电后恢复正常。
我倾向于认为有一个故障硬件以某种方式通过电源循环正确初始化。
知道可能出了什么问题吗?
[1]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13 root 20 0 0 0 0 S 100.0 0.0 30019,26 ksoftirqd/2
33 root 20 0 0 0 0 S 100.0 0.0 40025,54 ksoftirqd/7
53 root 20 0 0 0 0 S 100.0 0.0 65042,06 ksoftirqd/12
2842 root 20 0 14.0g 362m 11m S 5500.0 0.3 8206270h java
12830 root 20 0 104m 2400 1532 S 100.0 0.0 5139288h bash
2541 root 39 19 0 0 0 S 1.0 0.0 300194:24 kipmi0
14937 root 20 0 13516 1640 956 R 0.7 0.0 0:00.12 top
160 root 20 0 0 0 0 S 0.3 0.0 20012,57 kblockd/6
1 root 20 0 21444 1548 1240 S 0.0 0.0 4270563h init
2 root 20 0 0 0 0 S 0.0 0.0 785508,31 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 10237405h ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
7 root RT 0 0 0 0 R 0.0 0.0 300194:20 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 30019,26 ksoftirqd/1
10 root RT 0 0 0 0 R 0.0 0.0 300194:20 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
14 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 900583:01 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/