HP ML115 G5 按下电源按钮后不久自动关闭

HP ML115 G5 按下电源按钮后不久自动关闭

我们有一台(相当旧的)基于 HP ML115 G5 AMD 的服务器,在按下电源按钮后 10-15 秒后(我想是在风扇测试期间)自动关闭,在 BIOS POST 发出单声哔声之前。

我们需要一些远程(200 公里)硬件故障诊断方面的帮助。我们的硬件规格如下:

root@linux:~/# dmidecode -t1
# dmidecode 2.12
SMBIOS 2.5 present.

System Information
        Manufacturer: HP
        Product Name: ProLiant ML115 G5
        Serial Number: CZC94743QJ
         SKU Number: 470064-894`

root@linux:~/# head -n 30 dmidecode.txt 
# dmidecode 2.12

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: HP
        Version: O18    
        Release Date: 07/06/2009

此时,它可以稳定运行。我已设法通过以下方式将其打开:

  • 关闭服务器,
  • 拔掉电源线五分钟,
  • 将其较长的一端放置在地面上,CPU 散热器朝向天花板的方向。

如果我们将其置于标准位置,它就不会像我在开头写的那样打开。完全可重现。

电压/温度/风扇统计数据对我来说看起来不错:

root@linux:~/# ipmitool sdr
POST Error       | Not Readable      | ns
Memory ECC       | Not Readable      | ns
ACPI State       | 0x01              | ok
PCI Reset        | 0x00              | ok
CPU Fan          | 1048.88 RPM       | ok
Rear Fan         | 2107.04 RPM       | ok
CPU Diode        | 26.50 degrees C   | ok
Front Ambient    | 19 degrees C      | ok
System 12V       | 11.93 Volts       | ok
System 5V        | 5.12 Volts        | ok
System AUX 5V    | 4.98 Volts        | ok
System 3.3V      | 3.39 Volts        | ok
System AUX 3.3V  | 3.33 Volts        | ok
CPU Vcore        | 1.07 Volts        | ok
CPU 12V          | 11.82 Volts       | ok
HT 1.2V          | 1.20 Volts        | ok
Mem Vcore        | 1.81 Volts        | ok
MEM VTT          | 0.90 Volts        | ok
MCP55 1.5V       | 1.50 Volts        | ok
MCP55 1.4V       | 1.40 Volts        | ok
Therm-Trip       | 0x00              | ok
CPU Prochot      | 0x00              | ok
System Reset     | 0x00              | ok
NMI              | 0x00              | ok
PCI Error        | Not Readable      | ns
CPU Socket       | 0x01              | ok
LO100 Present    | 0x00              | ok
Watchdog         | Not Readable      | ns

IPMI 事件:

  18 | 03/18/2015 | 09:29:46 | Temperature #0x20 | Upper Non-critical going high | Asserted
  30 | 03/18/2015 | 09:30:08 | Temperature #0x20 | Upper Critical going high | Asserted
  48 | 03/18/2015 | 10:38:59 | Temperature #0x20 | Upper Non-critical going high | Asserted
  60 | 03/18/2015 | 10:39:20 | Temperature #0x20 | Upper Critical going high | Asserted
  78 | 03/18/2015 | 10:45:26 | Temperature #0x20 | Upper Non-critical going high | Asserted
  90 | 03/18/2015 | 10:45:30 | Temperature #0x20 | Upper Non-critical going high | Deasserted
  a8 | 03/18/2015 | 10:45:56 | Temperature #0x20 | Upper Non-critical going high | Asserted
  c0 | 03/18/2015 | 10:46:12 | Temperature #0x20 | Upper Critical going high | Asserted
  d8 | 03/18/2015 | 10:48:42 | Temperature #0x20 | Upper Non-critical going high | Asserted
  f0 | 03/18/2015 | 10:48:46 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 108 | 03/18/2015 | 10:49:04 | Temperature #0x20 | Upper Non-critical going high | Asserted
 120 | 03/18/2015 | 10:49:18 | Temperature #0x20 | Upper Critical going high | Asserted
 138 | 03/18/2015 | 10:50:24 | Temperature #0x20 | Upper Non-critical going high | Asserted
 150 | 03/18/2015 | 10:50:25 | Temperature #0x20 | Upper Critical going high | Asserted
 168 | 03/18/2015 | 10:57:53 | Temperature #0x20 | Upper Non-critical going high | Asserted
 180 | 03/18/2015 | 10:57:57 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 198 | 03/18/2015 | 10:58:24 | Temperature #0x20 | Upper Non-critical going high | Asserted
 1b0 | 03/18/2015 | 10:58:41 | Temperature #0x20 | Upper Critical going high | Asserted
 1c8 | 03/18/2015 | 11:14:23 | Temperature #0x20 | Upper Non-critical going high | Asserted
 1e0 | 03/18/2015 | 11:15:06 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 1f8 | 03/18/2015 | 11:16:33 | Temperature #0x20 | Upper Non-critical going high | Asserted
 210 | 03/18/2015 | 11:16:33 | Temperature #0x20 | Upper Critical going high | Asserted
 228 | 03/18/2015 | 11:49:12 | Temperature #0x20 | Upper Non-critical going high | Asserted
 240 | 03/18/2015 | 11:49:18 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 258 | 03/18/2015 | 11:55:45 | Temperature #0x20 | Upper Non-critical going high | Asserted
 270 | 03/18/2015 | 11:55:46 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 288 | 03/18/2015 | 11:56:32 | Temperature #0x20 | Upper Non-critical going high | Asserted
 2a0 | 03/18/2015 | 11:57:06 | Temperature #0x20 | Upper Critical going high | Asserted
 2b8 | 03/18/2015 | 12:00:11 | Temperature #0x20 | Upper Non-critical going high | Asserted
 2d0 | 03/18/2015 | 12:00:14 | Temperature #0x20 | Upper Non-critical going high | Deasserted
 2e8 | 03/18/2015 | 12:00:59 | Temperature #0x20 | Upper Non-critical going high | Asserted
 300 | 03/18/2015 | 12:01:34 | Temperature #0x20 | Upper Critical going high | Asserted
 318 | 07/06/2009 | 00:00:22 | Fan #0x42 | Upper Critical going high | Asserted
 330 | 11/13/2016 | 13:25:47 | Fan #0x41 | Upper Critical going high | Asserted
 348 | 11/13/2016 | 13:33:00 | Fan #0x41 | Upper Critical going high | Asserted
 360 | 11/13/2016 | 13:33:47 | Fan #0x41 | Upper Critical going high | Asserted
 378 | 11/13/2016 | 13:44:58 | Fan #0x41 | Upper Critical going high | Asserted
 390 | 11/13/2016 | 13:45:48 | Fan #0x41 | Upper Critical going high | Asserted
 3a8 | 11/13/2016 | 13:47:45 | Fan #0x41 | Upper Critical going high | Asserted
 3c0 | 12/01/2016 | 17:00:29 | Fan #0x41 | Upper Critical going high | Asserted
 3d8 | 12/01/2016 | 17:01:53 | Fan #0x41 | Upper Critical going high | Asserted
 3f0 | 12/01/2016 | 17:04:02 | Fan #0x41 | Upper Critical going high | Asserted
 408 | 12/01/2016 | 17:31:34 | Fan #0x41 | Upper Critical going high | Asserted
 420 | 12/01/2016 | 17:43:42 | Fan #0x41 | Upper Critical going high | Asserted

2016 年 11 月 13 日我第一次遇到这种情况,我认为可能是硬件看门狗,所以我们在 BIOS 中将其禁用。

该服务器有 2x1TB 磁盘,2x3TB(不带光驱)。365 瓦非热插拔、非冗余电源。

现在,我们建议更换盒子,但就我而言,我无法解释为什么会发生这种情况(我认为这是某种机械主板故障)。我想知道您是否还有其他想法。

** 更新,Chopper3 先生问我 是什么意思but CPU one is not standard。因此,原来的 hatsink 已损坏如下: 散热器损坏

时间和糟糕的材料选择,塑料本来就不耐压。从那次安装之后,我再也没有在其他盒子里看到过塑料支架……

服务器一直保持在良好的环境中,从未过热,没有受到阳光的直接影响,工作期间无人触碰它。

那是大约 1.5 年前的事了。我们在市场上再也找不到原装 HP 部件了。我们用大 3 倍的部件替换了它,因为当时 AM2 插座并不那么流行。我现在记不清它是否有 2 根信号线加上 VCC 和 GND (4),就像上面发布的库存一样。它可能只有三根。VCC + GND 和旋转信号 (3)。从那时起,我们多次停电,这种情况从未发生过。

答案1

我认为主板存在故障。例如焊点损坏或边缘部件损坏。我遇到过类似的故障,只需推一下主板,服务器就可以启动,但只要我松开压力,服务器就会因风扇故障而关闭或因 ECC 错误而挂起。

答案2

您的风扇可能出现故障,并且服务器配置为在严重风扇故障时停止。

相关内容