RHEL6 12G Dell 服务器上的“功率限制通知”问题

RHEL6 12G Dell 服务器上的“功率限制通知”问题

服务器:Poweredge r620
操作系统:RHEL 6.4
内核:2.6.32-358.18.1.el6.x86_64

我的生产环境中出现了应用程序警报。关键的 CPU 消耗大进程资源匮乏,导致处理积压。这个问题发生在最近部署的集群中的所有第 12 代戴尔服务器 (r620s) 上。据我所知,这种情况的发生与峰值 CPU 利用率相匹配,并伴随着大量的“功率限制通知”垃圾邮件dmesg。以下是其中一个事件的摘录:

Nov  7 10:15:15 someserver [.crit] CPU12: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU0: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU6: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU14: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU18: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU2: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU4: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU16: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU0: Package power limit notification (total events = 11)
Nov  7 10:15:15 someserver [.crit] CPU6: Package power limit notification (total events = 13)
Nov  7 10:15:15 someserver [.crit] CPU14: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU18: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU20: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU8: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU2: Package power limit notification (total events = 12)
Nov  7 10:15:15 someserver [.crit] CPU10: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU22: Core power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU4: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU16: Package power limit notification (total events = 13)
Nov  7 10:15:15 someserver [.crit] CPU20: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU8: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU10: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU22: Package power limit notification (total events = 14)
Nov  7 10:15:15 someserver [.crit] CPU15: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU3: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU1: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU5: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU17: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU13: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU15: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU3: Package power limit notification (total events = 374)
Nov  7 10:15:15 someserver [.crit] CPU1: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU5: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU7: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU19: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU17: Package power limit notification (total events = 377)
Nov  7 10:15:15 someserver [.crit] CPU9: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU21: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU23: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU11: Core power limit notification (total events = 369)
Nov  7 10:15:15 someserver [.crit] CPU13: Package power limit notification (total events = 376)
Nov  7 10:15:15 someserver [.crit] CPU7: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU19: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU9: Package power limit notification (total events = 374)
Nov  7 10:15:15 someserver [.crit] CPU21: Package power limit notification (total events = 375)
Nov  7 10:15:15 someserver [.crit] CPU23: Package power limit notification (total events = 374)

谷歌搜索后发现,这通常与 CPU 过热或电压调节启动有关。但我不认为这是正在发生的事情。集群中所有服务器的温度传感器运行良好,iDRAC 中的功率上限策略已禁用,并且我的系统配置文件在所有这些服务器上都设置为“性能”:

# omreport chassis biossetup | grep -A10 'System Profile'
System Profile Settings
------------------------------------------
System Profile                                    : Performance
CPU Power Management                              : Maximum Performance
Memory Frequency                                  : Maximum Performance
Turbo Boost                                       : Enabled
C1E                                               : Disabled
C States                                          : Disabled
Monitor/Mwait                                     : Enabled
Memory Patrol Scrub                               : Standard
Memory Refresh Rate                               : 1x
Memory Operating Voltage                          : Auto
Collaborative CPU Performance Control             : Disabled
  • 戴尔邮件列表帖子几乎完美地描述了症状。戴尔建议作者尝试使用性能配置文件,但这没有帮助。他最终在戴尔针对低延迟环境配置服务器的指南其中一种设置(或其组合)似乎已经解决了该问题。
  • Kernel.org 错误 #36182注意到功率限制中断调试默认处于启用状态,这会导致 CPU 电压调节启动时性能下降。
  • RHN 知识库文章(需要 RHN 登录)提到影响未运行性能配置文件的 PE r620 和 r720 服务器的问题,并建议更新两周前发布的内核。...除非我们正在运行性能配置文件...

我在网上找到的所有东西都让我困惑不已。到底发生了什么?

答案1

造成性能问题的不是电压调节,而是由此触发的调试内核中断。

尽管 Redhat 方面存在一些错误信息,但所有链接页面都指的是同一现象。无论是否使用性能配置文件,电压调节都会发生,这可能是由于涡轮增压功能已启用。无论出于何种原因,这些电压波动与内核 2.6.32-358.18.1.el6.x86_64 中默认启用的功率限制内核中断相互作用不佳。

已确认的解决方法:

  • 升级到最新发布的 Redhat 内核 (2.6.32-358.23.2.el6) 可禁用此调试并消除性能问题。
  • 添加以下内核参数将grub.conf禁用 PLN:clearcpuid=229

不稳定的解决方法:

  • 设置“性能”的系统配置文件。仅此一项还不足以禁用我们服务器上的 PLN。您的情况可能会有所不同。

坏的解决方法:

  • 将 ACPI 相关模块列入黑名单。我在一些论坛帖子中看到过这种情况。不明智,所以

相关内容