我有 2 台 HPE Proliant DL360 Gen10 服务器,配置几乎相同。它们都运行 CentOS 7.5。唯一的区别是其中一台有较新的固件和内核,以尝试解决此问题。
dmesg
反复报告以下内容,并且服务器的性能受到影响。
[Oct12 11:43] CPU5: Package temperature above threshold, cpu clock throttled (total events = 539077151)
[ +0.000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 539077144)
[ +0.000003] CPU4: Package temperature above threshold, cpu clock throttled (total events = 539077179)
[ +0.000002] CPU7: Package temperature above threshold, cpu clock throttled (total events = 539077201)
[ +0.000001] CPU3: Package temperature above threshold, cpu clock throttled (total events = 539077211)
[ +0.000004] CPU6: Package temperature above threshold, cpu clock throttled (total events = 539077197)
[ +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 539077208)
[ +0.000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 539077122)
[Oct12 11:44] CPU6: Core temperature above threshold, cpu clock throttled (total events = 447115263)
[ +0.000001] CPU2: Core temperature above threshold, cpu clock throttled (total events = 447115267)
[ +0.002025] CPU6: Core temperature/speed normal
HP iLO 报告的温度比sensors
报告的低约 30C。
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 2: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +95.0°C (high = +86.0°C, crit = +96.0°C)
Core 4: +94.0°C (high = +86.0°C, crit = +96.0°C)
在读取传感器数据的同时,HPE iLO 界面报告 CPU 温度为 55C。
当我运行时sensors
,我得到以下信息dmesg
:
[Oct12 11:46] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180313/exfield-393)
[ +0.000726] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180313/psparse-516)
[ +0.000500] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180313/power_meter-338)
我今天早上更新到了最新的内核(4.18.13-1.el7.elrepo.x86_64
),但这也无济于事。
答案1
从 ILO 网络界面打开系统的 IML 日志并查看其报告的事件。
这是检查 HPE 服务器设备硬件状态的权威方法。
答案2
我能够通过更新操作系统中的内核来解决这个问题。我现在使用的是 4.18.13-1.el7.elrepo.x86_64,报告的温度与 iLO UI 中的不同,但 CPU 温度和“高温”之间的比率要好得多,并且与 iLO 比率更加一致。
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 0: +72.0°C (high = +86.0°C, crit = +96.0°C)
Core 2: +72.0°C (high = +86.0°C, crit = +96.0°C)
Core 3: +74.0°C (high = +86.0°C, crit = +96.0°C)
Core 4: +71.0°C (high = +86.0°C, crit = +96.0°C)
答案3
英特尔的热监控会导致许多不同的“温度”,具体取决于您使用的接口/MSR。此外,不同的处理器可能具有基于制造的不同阈值。
可能还想尝试一下 UEFI 中的一些热调节。有“最大冷却”选项可以让你避免达到阈值。
最后,记下您使用的选件卡,看看是否有任何影响。IO 卡可能会使热监控出错,从而使 FW/OS SW 认为系统处于热故障状态。