为什么同样的任务在 Linux 内核 4.9 和 5.4 上占用的 CPU 不同?

为什么同样的任务在 Linux 内核 4.9 和 5.4 上占用的 CPU 不同?

我的应用程序是一项计算密集型任务(即视频编码)。当它在 Linux 内核 4.9(Ubuntu 16.04)上运行时,CPU 使用率为 3300%。但是当它在 Linux 内核 5.4(Ubuntu 20.04)上运行时,CPU 使用率仅为 2850%。保证进程执行相同的工作。

所以我想知道 Linux 内核在 4.9 和 5.4 之间是否进行了一些 CPU 调度优化或相关工作?您能给出一些建议来调查原因吗?


  1. 可以确认性能提升来自于 Linux 内核 5.4,因为 Linux 内核 5.3 上的性能与 Linux 内核 4.9 相同。
  2. 确认性能增益与 libc 无关,因为在 libc 为 2.23 的 Linux 内核 5.10 上,性能与 libc 为 2.31 的 Linux 内核 5.4 上的性能相同
CPU Info:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:              7
CPU MHz:               2200.000
BogoMIPS:              4401.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Output of perf stat on Linux Kernel 4.9

 Performance counter stats for process id '32504':

    3146297.833447      cpu-clock (msec)          #   32.906 CPUs utilized          
         1,718,778      context-switches          #    0.546 K/sec                  
           574,717      cpu-migrations            #    0.183 K/sec                  
         2,796,706      page-faults               #    0.889 K/sec                  
 6,193,409,215,015      cycles                    #    1.968 GHz                      (30.76%)
 6,948,575,328,419      instructions              #    1.12  insn per cycle           (38.47%)
   540,538,530,660      branches                  #  171.801 M/sec                    (38.47%)
    33,087,740,169      branch-misses             #    6.12% of all branches          (38.50%)
 1,966,141,393,632      L1-dcache-loads           #  624.906 M/sec                    (38.49%)
   184,477,765,497      L1-dcache-load-misses     #    9.38% of all L1-dcache hits    (38.47%)
     8,324,742,443      LLC-loads                 #    2.646 M/sec                    (30.78%)
     3,835,471,095      LLC-load-misses           #   92.15% of all LL-cache hits     (30.76%)
   <not supported>      L1-icache-loads                                             
   187,604,831,388      L1-icache-load-misses                                         (30.78%)
 1,965,198,121,190      dTLB-loads                #  624.607 M/sec                    (30.81%)
       438,496,889      dTLB-load-misses          #    0.02% of all dTLB cache hits   (30.79%)
     7,139,892,384      iTLB-loads                #    2.269 M/sec                    (30.79%)
       260,660,265      iTLB-load-misses          #    3.65% of all iTLB cache hits   (30.77%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      95.615072142 seconds time elapsed
Output of perf stat on Linux Kernel 5.4

 Performance counter stats for process id '3355137':

      2,718,192.32 msec cpu-clock                 #   29.184 CPUs utilized          
         1,719,910      context-switches          #    0.633 K/sec                  
           448,685      cpu-migrations            #    0.165 K/sec                  
         3,884,586      page-faults               #    0.001 M/sec                  
 5,927,930,305,757      cycles                    #    2.181 GHz                      (30.77%)
 6,848,723,995,972      instructions              #    1.16  insn per cycle           (38.47%)
   536,856,379,853      branches                  #  197.505 M/sec                    (38.47%)
    32,245,288,271      branch-misses             #    6.01% of all branches          (38.48%)
 1,935,640,517,821      L1-dcache-loads           #  712.106 M/sec                    (38.47%)
   177,978,528,204      L1-dcache-load-misses     #    9.19% of all L1-dcache hits    (38.49%)
     8,119,842,688      LLC-loads                 #    2.987 M/sec                    (30.77%)
     3,625,986,107      LLC-load-misses           #   44.66% of all LL-cache hits     (30.75%)
   <not supported>      L1-icache-loads                                             
   184,001,558,310      L1-icache-load-misses                                         (30.76%)
 1,934,701,161,746      dTLB-loads                #  711.760 M/sec                    (30.74%)
       676,618,636      dTLB-load-misses          #    0.03% of all dTLB cache hits   (30.76%)
     6,275,901,454      iTLB-loads                #    2.309 M/sec                    (30.78%)
       391,706,425      iTLB-load-misses          #    6.24% of all iTLB cache hits   (30.78%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      93.139551411 seconds time elapsed

