Ubuntu 20.04 上的 CPU 性能突然下降

Ubuntu 20.04 上的 CPU 性能突然下降

我有一台 Poweredge R620 服务器,配备 2 个 Intel Xeon CPU E5-2695 v2 @ 2.40GHz

这台服务器多年来一直运行良好,没有出现任何问题。10 个月前,它重新安装了全新的 ubuntu 20.04,从那时起就运行良好,托管了大约 15 台带有 KVM 的虚拟机,没有出现任何问题。

有一天,没有任何特殊措施,所有虚拟机的 CPU 使用率突然增加。与此同时,服务器的功耗急剧下降。

最初我们认为某个特定的虚拟机是罪魁祸首,但在将所有虚拟机移至另一台类似的服务器后,似乎所有虚拟机都运行正常。

我在服务器上做了一些测试(sysbench aso),CPU 性能真的很差:据报道,这款 Xeon 2695V2 的性能比 E5-2620V2 低 10 倍

在 sysbench 期间,/proc/cpuinfo 显示所有核心的频率在 150MHz(!) 和 1GHz 之间...根据 CPU 规格,这是不可能的

cpufreq-info (example last core 47)
analyzing CPU 47 :
driver : intel_pstate
CPUs which run at the same hardware frequency: 47
CPUs which need to have their frequency coordinated by software: 47
maximum transition latency: 4294.55 ms.
hardware limits : 1.20 GHz - 3.20 GHz
available cpufreq governors : performance, powersave
current policy :  frequency should be within 1.20 GHz and 3.20 GHz.
              The governor "performance" may decide which speed to use
              within this range.
current CPU frequency: 482 MHz.

它显示 CPU 处于性能状态,但当前频率只有 482MHz,而它应该在 1.2 GHz 和 3.2GHz 之间

我很确定重新启动服务器可以解决问题,但我想了解发生了什么。

日志中没有任何内容。CPU 温度正常(压力测试期间不会改变!)。

在加载期间使用 intel_reg_pp 会产生以下结果:

+----+------------------------------+---------+----------+
|  # | MSR Register                 | Address |   Core 0 |
+----+------------------------------+---------+----------+
|  0 | IA32_PERF_CTL                |   0x199 |      c00 |
|  1 | IA32_CLOCK_MODULATION        |   0x19A |        0 |
|  2 | IA32_THERM_INTERRUPT         |   0x19B |        3 |
|  3 | IA32_HWP_THERM_STATUS        |   0x19C | 88310c00 |
|  4 | IA32_MISC_ENABLE             |   0x1A0 |   850089 |
|  5 | IA32_PACKAGE_THERM_MARGIN    |   0x1A1 |     1db0 |
|  6 | IA32_TEMPERATURE_TARGET      |   0x1A2 |   571000 |
|  7 | IA32_PKG_THERM_STATUS        |   0x1B1 | 882d0c00 |
|  8 | MSR_PKG_ENERGY_STATUS        |   0x611 | 5f8a8ad6 |
|  9 | MSR_PKG_STATUS               |   0x613 |  4d678f3 |
| 10 | MSR_PPERF                    |   0x64E |      N/A |
| 11 | MSR_CORE_PERF_LIMIT_REASONS  |   0x690 |      N/A |
| 12 | IA32_PM_ENABLE               |   0x770 |      N/A |
| 13 | IA32_HWP_CAPABILITIES        |   0x771 |      N/A |
| 14 | IA32_HWP_REQUEST_PKG         |   0x772 |      N/A |
| 15 | IA32_HWP_INTERRUPT           |   0x773 |      N/A |
| 16 | IA32_HWP_REQUEST             |   0x774 |      N/A |
| 17 | IA32_HWP_PECI_REQUEST_INFO   |   0x775 |      N/A |
| 18 | IA32_HWP_STATUS              |   0x777 |      N/A |
+----+------------------------------+---------+----------+

并使用 turbostat(PkgWatt 和 PKG_% 是两个处理器包的总和。两个处理器包的行为非常相似)

turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ,PKG_% --interval 2
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt PKG_%
9.86    1494    15787   39  59.64   104.12
9.97    1458    12634   41  59.12   102.13
9.83    1487    14591   39  60.29   102.96
8.46    1586    13092   39  59.51   95.38
10.79   1463    14587   40  59.82   96.83
11.27   1438    12438   39  59.48   100.01 
turbostat: cpu0 jitter 5352 94208         <- Applying load
turbostat: cpu24 jitter 5104 94992
turbostat: cpu2 jitter 94840 5192
... lots of similar jitter messages
31.74   593     16989   39  54.99   125.91
... jitter message again...
99.20   288     41031   39  66.21   194.88
99.13   403     36165   40  68.07   189.75
99.81   458     32915   40  70.00   190.04
99.21   503     36909   41  72.91   194.20
99.26   528     36361   40  71.29   187.86
99.02   575     39900   41  74.99   193.81
98.34   605     40204   40  73.76   188.68
69.67   684     35305   40  78.59   134.74
13.41   1678    22536   40  66.61   0.97   <- Load Removed
8.77    1617    12158   39  61.00   22.58
8.53    1611    12454   39  60.03   65.16
10.47   1440    14426   40  59.05   92.75
12.00   1387    9389    40  59.17   101.51

涡轮增压器无静音选项

turbostat version 19.08.31 - Len Brown <[email protected]>
CPUID(0): GenuineIntel 0xd CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:3e:4 (6:62:4)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB
cpu36: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu36: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 570 sec. Joule Counter Range, at 115 Watts
cpu36: MSR_PLATFORM_INFO: 0xc10e4811800
12 * 100.0 = 1200.0 MHz max efficiency frequency
24 * 100.0 = 2400.0 MHz base frequency
cpu36: MSR_IA32_POWER_CTL: 0x25000059 (C1E auto-promotion: DISabled)
cpu36: MSR_TURBO_RATIO_LIMIT1: 0x1c1c1c1c1c1c1c1c
28 * 100.0 = 2800.0 MHz max turbo 16 active cores
28 * 100.0 = 2800.0 MHz max turbo 15 active cores
28 * 100.0 = 2800.0 MHz max turbo 14 active cores
28 * 100.0 = 2800.0 MHz max turbo 13 active cores
28 * 100.0 = 2800.0 MHz max turbo 12 active cores
28 * 100.0 = 2800.0 MHz max turbo 11 active cores
28 * 100.0 = 2800.0 MHz max turbo 10 active cores
28 * 100.0 = 2800.0 MHz max turbo 9 active cores
cpu36: MSR_TURBO_RATIO_LIMIT: 0x1c1c1c1c1d1e1f20
28 * 100.0 = 2800.0 MHz max turbo 8 active cores
28 * 100.0 = 2800.0 MHz max turbo 7 active cores
28 * 100.0 = 2800.0 MHz max turbo 6 active cores
28 * 100.0 = 2800.0 MHz max turbo 5 active cores
29 * 100.0 = 2900.0 MHz max turbo 4 active cores
30 * 100.0 = 3000.0 MHz max turbo 3 active cores
31 * 100.0 = 3100.0 MHz max turbo 2 active cores
32 * 100.0 = 3200.0 MHz max turbo 1 active cores
cpu36: MSR_PKG_CST_CONFIG_CONTROL: 0x00008400 (locked, pkg-cstate-limit=0 (pc0))
cpu36: POLL: CPUIDLE CORE POLL IDLE
cpu36: C1: MWAIT 0x00
cpu36: C1E: MWAIT 0x01
cpu36: C3: MWAIT 0x10
cpu36: C6: MWAIT 0x20
cpu36: cpufreq driver: intel_pstate
cpu36: cpufreq governor: performance
cpufreq intel_pstate no_turbo: 0
cpu36: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x2f05a002000398 (115 W TDP, RAPL 64 - 180 W, 0.045898 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x68450005a8398 (UNlocked)
cpu0: PKG Limit #1: ENabled (115.000000 Watts, 10.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: ENabled (138.000000 Watts, 0.007812* sec, clamp DISabled)
cpu0: MSR_DRAM_POWER_INFO,: 0x2f00fc006800e2 (28 W TDP, RAPL 13 - 32 W, 0.045898 sec.)
cpu0: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: DRAM Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu1: MSR_RAPL_POWER_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
cpu1: MSR_PKG_POWER_INFO: 0x2f05a002000398 (115 W TDP, RAPL 64 - 180 W, 0.045898 sec.)
cpu1: MSR_PKG_POWER_LIMIT: 0x68450005a8398 (UNlocked)
cpu1: PKG Limit #1: ENabled (115.000000 Watts, 10.000000 sec, clamp DISabled)
cpu1: PKG Limit #2: ENabled (138.000000 Watts, 0.007812* sec, clamp DISabled)
cpu1: MSR_DRAM_POWER_INFO,: 0x2f00fc006800e2 (28 W TDP, RAPL 13 - 32 W, 0.045898 sec.)
cpu1: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu1: DRAM Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu1: MSR_PP0_POLICY: 0
cpu1: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu1: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00571000 (87 C)
cpu1: MSR_IA32_TEMPERATURE_TARGET: 0x00571000 (87 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88300c00 (39 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (87 C, 87 C)
cpu1: MSR_IA32_PACKAGE_THERM_STATUS: 0x88380c00 (31 C)
cpu1: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (87 C, 87 C)
cpu36: MSR_PKGC3_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu36: MSR_PKGC6_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu36: MSR_PKGC7_IRTL: 0x00000000 (NOTvalid, 0 ns)

答案1

重新启动后,问题得到解决,并且在两年内没有更换硬件的情况下再也没有出现过这个问题。

相关内容