我的应用程序是一项计算密集型任务(即视频编码)。当它在 Linux 内核 4.9(Ubuntu 16.04)上运行时,CPU 使用率为 3300%。但是当它在 Linux 内核 5.4(Ubuntu 20.04)上运行时,CPU 使用率仅为 2850%。保证进程执行相同的工作。
所以我想知道 Linux 内核在 4.9 和 5.4 之间是否进行了一些 CPU 调度优化或相关工作?您能给出一些建议来调查原因吗?
供你参考,
- 可以确认性能提升来自于 Linux 内核 5.4,因为 Linux 内核 5.3 上的性能与 Linux 内核 4.9 相同。
- 确认性能增益与 libc 无关,因为在 libc 为 2.23 的 Linux 内核 5.10 上,性能与 libc 为 2.31 的 Linux 内核 5.4 上的性能相同
CPU Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.000
BogoMIPS: 4401.69
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Output of perf stat on Linux Kernel 4.9
Performance counter stats for process id '32504':
3146297.833447 cpu-clock (msec) # 32.906 CPUs utilized
1,718,778 context-switches # 0.546 K/sec
574,717 cpu-migrations # 0.183 K/sec
2,796,706 page-faults # 0.889 K/sec
6,193,409,215,015 cycles # 1.968 GHz (30.76%)
6,948,575,328,419 instructions # 1.12 insn per cycle (38.47%)
540,538,530,660 branches # 171.801 M/sec (38.47%)
33,087,740,169 branch-misses # 6.12% of all branches (38.50%)
1,966,141,393,632 L1-dcache-loads # 624.906 M/sec (38.49%)
184,477,765,497 L1-dcache-load-misses # 9.38% of all L1-dcache hits (38.47%)
8,324,742,443 LLC-loads # 2.646 M/sec (30.78%)
3,835,471,095 LLC-load-misses # 92.15% of all LL-cache hits (30.76%)
<not supported> L1-icache-loads
187,604,831,388 L1-icache-load-misses (30.78%)
1,965,198,121,190 dTLB-loads # 624.607 M/sec (30.81%)
438,496,889 dTLB-load-misses # 0.02% of all dTLB cache hits (30.79%)
7,139,892,384 iTLB-loads # 2.269 M/sec (30.79%)
260,660,265 iTLB-load-misses # 3.65% of all iTLB cache hits (30.77%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
95.615072142 seconds time elapsed
Output of perf stat on Linux Kernel 5.4
Performance counter stats for process id '3355137':
2,718,192.32 msec cpu-clock # 29.184 CPUs utilized
1,719,910 context-switches # 0.633 K/sec
448,685 cpu-migrations # 0.165 K/sec
3,884,586 page-faults # 0.001 M/sec
5,927,930,305,757 cycles # 2.181 GHz (30.77%)
6,848,723,995,972 instructions # 1.16 insn per cycle (38.47%)
536,856,379,853 branches # 197.505 M/sec (38.47%)
32,245,288,271 branch-misses # 6.01% of all branches (38.48%)
1,935,640,517,821 L1-dcache-loads # 712.106 M/sec (38.47%)
177,978,528,204 L1-dcache-load-misses # 9.19% of all L1-dcache hits (38.49%)
8,119,842,688 LLC-loads # 2.987 M/sec (30.77%)
3,625,986,107 LLC-load-misses # 44.66% of all LL-cache hits (30.75%)
<not supported> L1-icache-loads
184,001,558,310 L1-icache-load-misses (30.76%)
1,934,701,161,746 dTLB-loads # 711.760 M/sec (30.74%)
676,618,636 dTLB-load-misses # 0.03% of all dTLB cache hits (30.76%)
6,275,901,454 iTLB-loads # 2.309 M/sec (30.78%)
391,706,425 iTLB-load-misses # 6.24% of all iTLB cache hits (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
93.139551411 seconds time elapsed