封装温度高于阈值,CPU 时钟受到限制

封装温度高于阈值,CPU 时钟受到限制

我有 2 台 HPE Proliant DL360 Gen10 服务器,配置几乎相同。它们都运行 CentOS 7.5。唯一的区别是其中一台有较新的固件和内核,以尝试解决此问题。

dmesg反复报告以下内容,并且服务器的性能受到影响。

[Oct12 11:43] CPU5: Package temperature above threshold, cpu clock throttled (total events = 539077151)
[  +0.000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 539077144)
[  +0.000003] CPU4: Package temperature above threshold, cpu clock throttled (total events = 539077179)
[  +0.000002] CPU7: Package temperature above threshold, cpu clock throttled (total events = 539077201)
[  +0.000001] CPU3: Package temperature above threshold, cpu clock throttled (total events = 539077211)
[  +0.000004] CPU6: Package temperature above threshold, cpu clock throttled (total events = 539077197)
[  +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 539077208)
[  +0.000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 539077122)
[Oct12 11:44] CPU6: Core temperature above threshold, cpu clock throttled (total events = 447115263)
[  +0.000001] CPU2: Core temperature above threshold, cpu clock throttled (total events = 447115267)
[  +0.002025] CPU6: Core temperature/speed normal

HP iLO 报告的温度比sensors报告的低约 30C。

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 4:        +94.0°C  (high = +86.0°C, crit = +96.0°C)

在读取传感器数据的同时,HPE iLO 界面报告 CPU 温度为 55C。

当我运行时sensors,我得到以下信息dmesg

[Oct12 11:46] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180313/exfield-393)
[  +0.000726] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180313/psparse-516)
[  +0.000500] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180313/power_meter-338)

我今天早上更新到了最新的内核(4.18.13-1.el7.elrepo.x86_64),但这也无济于事。

答案1

从 ILO 网络界面打开系统的 IML 日志并查看其报告的事件。

这是检查 HPE 服务器设备硬件状态的权威方法。

答案2

我能够通过更新操作系统中的内核来解决这个问题。我现在使用的是 4.18.13-1.el7.elrepo.x86_64,报告的温度与 iLO UI 中的不同,但 CPU 温度和“高温”之间的比率要好得多,并且与 iLO 比率更加一致。

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +74.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:        +72.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:        +72.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:        +74.0°C  (high = +86.0°C, crit = +96.0°C)
Core 4:        +71.0°C  (high = +86.0°C, crit = +96.0°C)

答案3

英特尔的热监控会导致许多不同的“温度”,具体取决于您使用的接口/MSR。此外,不同的处理器可能具有基于制造的不同阈值。

可能还想尝试一下 UEFI 中的一些热调节。有“最大冷却”选项可以让你避免达到阈值。

最后,记下您使用的选件卡,看看是否有任何影响。IO 卡可能会使热监控出错,从而使 FW/OS SW 认为系统处于热故障状态。

相关内容