我有一台 Poweredge R620 服务器,配备 2 个 Intel Xeon CPU E5-2695 v2 @ 2.40GHz
这台服务器多年来一直运行良好,没有出现任何问题。10 个月前,它重新安装了全新的 ubuntu 20.04,从那时起就运行良好,托管了大约 15 台带有 KVM 的虚拟机,没有出现任何问题。
有一天,没有任何特殊措施,所有虚拟机的 CPU 使用率突然增加。与此同时,服务器的功耗急剧下降。
最初我们认为某个特定的虚拟机是罪魁祸首,但在将所有虚拟机移至另一台类似的服务器后,似乎所有虚拟机都运行正常。
我在服务器上做了一些测试(sysbench aso),CPU 性能真的很差:据报道,这款 Xeon 2695V2 的性能比 E5-2620V2 低 10 倍
在 sysbench 期间,/proc/cpuinfo 显示所有核心的频率在 150MHz(!) 和 1GHz 之间...根据 CPU 规格,这是不可能的
cpufreq-info (example last core 47)
analyzing CPU 47 :
driver : intel_pstate
CPUs which run at the same hardware frequency: 47
CPUs which need to have their frequency coordinated by software: 47
maximum transition latency: 4294.55 ms.
hardware limits : 1.20 GHz - 3.20 GHz
available cpufreq governors : performance, powersave
current policy : frequency should be within 1.20 GHz and 3.20 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency: 482 MHz.
它显示 CPU 处于性能状态,但当前频率只有 482MHz,而它应该在 1.2 GHz 和 3.2GHz 之间
我很确定重新启动服务器可以解决问题,但我想了解发生了什么。
日志中没有任何内容。CPU 温度正常(压力测试期间不会改变!)。
在加载期间使用 intel_reg_pp 会产生以下结果:
+----+------------------------------+---------+----------+
| # | MSR Register | Address | Core 0 |
+----+------------------------------+---------+----------+
| 0 | IA32_PERF_CTL | 0x199 | c00 |
| 1 | IA32_CLOCK_MODULATION | 0x19A | 0 |
| 2 | IA32_THERM_INTERRUPT | 0x19B | 3 |
| 3 | IA32_HWP_THERM_STATUS | 0x19C | 88310c00 |
| 4 | IA32_MISC_ENABLE | 0x1A0 | 850089 |
| 5 | IA32_PACKAGE_THERM_MARGIN | 0x1A1 | 1db0 |
| 6 | IA32_TEMPERATURE_TARGET | 0x1A2 | 571000 |
| 7 | IA32_PKG_THERM_STATUS | 0x1B1 | 882d0c00 |
| 8 | MSR_PKG_ENERGY_STATUS | 0x611 | 5f8a8ad6 |
| 9 | MSR_PKG_STATUS | 0x613 | 4d678f3 |
| 10 | MSR_PPERF | 0x64E | N/A |
| 11 | MSR_CORE_PERF_LIMIT_REASONS | 0x690 | N/A |
| 12 | IA32_PM_ENABLE | 0x770 | N/A |
| 13 | IA32_HWP_CAPABILITIES | 0x771 | N/A |
| 14 | IA32_HWP_REQUEST_PKG | 0x772 | N/A |
| 15 | IA32_HWP_INTERRUPT | 0x773 | N/A |
| 16 | IA32_HWP_REQUEST | 0x774 | N/A |
| 17 | IA32_HWP_PECI_REQUEST_INFO | 0x775 | N/A |
| 18 | IA32_HWP_STATUS | 0x777 | N/A |
+----+------------------------------+---------+----------+
并使用 turbostat(PkgWatt 和 PKG_% 是两个处理器包的总和。两个处理器包的行为非常相似)
turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ,PKG_% --interval 2
Busy% Bzy_MHz IRQ PkgTmp PkgWatt PKG_%
9.86 1494 15787 39 59.64 104.12
9.97 1458 12634 41 59.12 102.13
9.83 1487 14591 39 60.29 102.96
8.46 1586 13092 39 59.51 95.38
10.79 1463 14587 40 59.82 96.83
11.27 1438 12438 39 59.48 100.01
turbostat: cpu0 jitter 5352 94208 <- Applying load
turbostat: cpu24 jitter 5104 94992
turbostat: cpu2 jitter 94840 5192
... lots of similar jitter messages
31.74 593 16989 39 54.99 125.91
... jitter message again...
99.20 288 41031 39 66.21 194.88
99.13 403 36165 40 68.07 189.75
99.81 458 32915 40 70.00 190.04
99.21 503 36909 41 72.91 194.20
99.26 528 36361 40 71.29 187.86
99.02 575 39900 41 74.99 193.81
98.34 605 40204 40 73.76 188.68
69.67 684 35305 40 78.59 134.74
13.41 1678 22536 40 66.61 0.97 <- Load Removed
8.77 1617 12158 39 61.00 22.58
8.53 1611 12454 39 60.03 65.16
10.47 1440 14426 40 59.05 92.75
12.00 1387 9389 40 59.17 101.51
涡轮增压器无静音选项
turbostat version 19.08.31 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0xd CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:3e:4 (6:62:4)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB
cpu36: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu36: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 570 sec. Joule Counter Range, at 115 Watts
cpu36: MSR_PLATFORM_INFO: 0xc10e4811800
12 * 100.0 = 1200.0 MHz max efficiency frequency
24 * 100.0 = 2400.0 MHz base frequency
cpu36: MSR_IA32_POWER_CTL: 0x25000059 (C1E auto-promotion: DISabled)
cpu36: MSR_TURBO_RATIO_LIMIT1: 0x1c1c1c1c1c1c1c1c
28 * 100.0 = 2800.0 MHz max turbo 16 active cores
28 * 100.0 = 2800.0 MHz max turbo 15 active cores
28 * 100.0 = 2800.0 MHz max turbo 14 active cores
28 * 100.0 = 2800.0 MHz max turbo 13 active cores
28 * 100.0 = 2800.0 MHz max turbo 12 active cores
28 * 100.0 = 2800.0 MHz max turbo 11 active cores
28 * 100.0 = 2800.0 MHz max turbo 10 active cores
28 * 100.0 = 2800.0 MHz max turbo 9 active cores
cpu36: MSR_TURBO_RATIO_LIMIT: 0x1c1c1c1c1d1e1f20
28 * 100.0 = 2800.0 MHz max turbo 8 active cores
28 * 100.0 = 2800.0 MHz max turbo 7 active cores
28 * 100.0 = 2800.0 MHz max turbo 6 active cores
28 * 100.0 = 2800.0 MHz max turbo 5 active cores
29 * 100.0 = 2900.0 MHz max turbo 4 active cores
30 * 100.0 = 3000.0 MHz max turbo 3 active cores
31 * 100.0 = 3100.0 MHz max turbo 2 active cores
32 * 100.0 = 3200.0 MHz max turbo 1 active cores
cpu36: MSR_PKG_CST_CONFIG_CONTROL: 0x00008400 (locked, pkg-cstate-limit=0 (pc0))
cpu36: POLL: CPUIDLE CORE POLL IDLE
cpu36: C1: MWAIT 0x00
cpu36: C1E: MWAIT 0x01
cpu36: C3: MWAIT 0x10
cpu36: C6: MWAIT 0x20
cpu36: cpufreq driver: intel_pstate
cpu36: cpufreq governor: performance
cpufreq intel_pstate no_turbo: 0
cpu36: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x2f05a002000398 (115 W TDP, RAPL 64 - 180 W, 0.045898 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x68450005a8398 (UNlocked)
cpu0: PKG Limit #1: ENabled (115.000000 Watts, 10.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: ENabled (138.000000 Watts, 0.007812* sec, clamp DISabled)
cpu0: MSR_DRAM_POWER_INFO,: 0x2f00fc006800e2 (28 W TDP, RAPL 13 - 32 W, 0.045898 sec.)
cpu0: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: DRAM Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu1: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu1: MSR_PKG_POWER_INFO: 0x2f05a002000398 (115 W TDP, RAPL 64 - 180 W, 0.045898 sec.)
cpu1: MSR_PKG_POWER_LIMIT: 0x68450005a8398 (UNlocked)
cpu1: PKG Limit #1: ENabled (115.000000 Watts, 10.000000 sec, clamp DISabled)
cpu1: PKG Limit #2: ENabled (138.000000 Watts, 0.007812* sec, clamp DISabled)
cpu1: MSR_DRAM_POWER_INFO,: 0x2f00fc006800e2 (28 W TDP, RAPL 13 - 32 W, 0.045898 sec.)
cpu1: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu1: DRAM Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu1: MSR_PP0_POLICY: 0
cpu1: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu1: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00571000 (87 C)
cpu1: MSR_IA32_TEMPERATURE_TARGET: 0x00571000 (87 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88300c00 (39 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (87 C, 87 C)
cpu1: MSR_IA32_PACKAGE_THERM_STATUS: 0x88380c00 (31 C)
cpu1: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (87 C, 87 C)
cpu36: MSR_PKGC3_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu36: MSR_PKGC6_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu36: MSR_PKGC7_IRTL: 0x00000000 (NOTvalid, 0 ns)
答案1
重新启动后,问题得到解决,并且在两年内没有更换硬件的情况下再也没有出现过这个问题。