服务器:Poweredge r620
操作系统:RHEL 6.4
内核:2.6.32-358.18.1.el6.x86_64
我的生产环境中出现了应用程序警报。关键的 CPU 消耗大进程资源匮乏,导致处理积压。这个问题发生在最近部署的集群中的所有第 12 代戴尔服务器 (r620s) 上。据我所知,这种情况的发生与峰值 CPU 利用率相匹配,并伴随着大量的“功率限制通知”垃圾邮件dmesg
。以下是其中一个事件的摘录:
Nov 7 10:15:15 someserver [.crit] CPU12: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU0: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU6: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU14: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU18: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU2: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU4: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU16: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU0: Package power limit notification (total events = 11)
Nov 7 10:15:15 someserver [.crit] CPU6: Package power limit notification (total events = 13)
Nov 7 10:15:15 someserver [.crit] CPU14: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU18: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU20: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU8: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU2: Package power limit notification (total events = 12)
Nov 7 10:15:15 someserver [.crit] CPU10: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU22: Core power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU4: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU16: Package power limit notification (total events = 13)
Nov 7 10:15:15 someserver [.crit] CPU20: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU8: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU10: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU22: Package power limit notification (total events = 14)
Nov 7 10:15:15 someserver [.crit] CPU15: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU3: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU1: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU5: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU17: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU13: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU15: Package power limit notification (total events = 375)
Nov 7 10:15:15 someserver [.crit] CPU3: Package power limit notification (total events = 374)
Nov 7 10:15:15 someserver [.crit] CPU1: Package power limit notification (total events = 376)
Nov 7 10:15:15 someserver [.crit] CPU5: Package power limit notification (total events = 376)
Nov 7 10:15:15 someserver [.crit] CPU7: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU19: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU17: Package power limit notification (total events = 377)
Nov 7 10:15:15 someserver [.crit] CPU9: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU21: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU23: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU11: Core power limit notification (total events = 369)
Nov 7 10:15:15 someserver [.crit] CPU13: Package power limit notification (total events = 376)
Nov 7 10:15:15 someserver [.crit] CPU7: Package power limit notification (total events = 375)
Nov 7 10:15:15 someserver [.crit] CPU19: Package power limit notification (total events = 375)
Nov 7 10:15:15 someserver [.crit] CPU9: Package power limit notification (total events = 374)
Nov 7 10:15:15 someserver [.crit] CPU21: Package power limit notification (total events = 375)
Nov 7 10:15:15 someserver [.crit] CPU23: Package power limit notification (total events = 374)
谷歌搜索后发现,这通常与 CPU 过热或电压调节启动有关。但我不认为这是正在发生的事情。集群中所有服务器的温度传感器运行良好,iDRAC 中的功率上限策略已禁用,并且我的系统配置文件在所有这些服务器上都设置为“性能”:
# omreport chassis biossetup | grep -A10 'System Profile'
System Profile Settings
------------------------------------------
System Profile : Performance
CPU Power Management : Maximum Performance
Memory Frequency : Maximum Performance
Turbo Boost : Enabled
C1E : Disabled
C States : Disabled
Monitor/Mwait : Enabled
Memory Patrol Scrub : Standard
Memory Refresh Rate : 1x
Memory Operating Voltage : Auto
Collaborative CPU Performance Control : Disabled
- 戴尔邮件列表帖子几乎完美地描述了症状。戴尔建议作者尝试使用性能配置文件,但这没有帮助。他最终在戴尔针对低延迟环境配置服务器的指南其中一种设置(或其组合)似乎已经解决了该问题。
- Kernel.org 错误 #36182注意到功率限制中断调试默认处于启用状态,这会导致 CPU 电压调节启动时性能下降。
- RHN 知识库文章(需要 RHN 登录)提到影响未运行性能配置文件的 PE r620 和 r720 服务器的问题,并建议更新两周前发布的内核。...除非我们正在运行性能配置文件...
我在网上找到的所有东西都让我困惑不已。到底发生了什么?
答案1
造成性能问题的不是电压调节,而是由此触发的调试内核中断。
尽管 Redhat 方面存在一些错误信息,但所有链接页面都指的是同一现象。无论是否使用性能配置文件,电压调节都会发生,这可能是由于涡轮增压功能已启用。无论出于何种原因,这些电压波动与内核 2.6.32-358.18.1.el6.x86_64 中默认启用的功率限制内核中断相互作用不佳。
已确认的解决方法:
- 升级到最新发布的 Redhat 内核 (2.6.32-358.23.2.el6) 可禁用此调试并消除性能问题。
- 添加以下内核参数将
grub.conf
禁用 PLN:clearcpuid=229
不稳定的解决方法:
- 设置“性能”的系统配置文件。仅此一项还不足以禁用我们服务器上的 PLN。您的情况可能会有所不同。
坏的解决方法:
- 将 ACPI 相关模块列入黑名单。我在一些论坛帖子中看到过这种情况。不明智,所以不。