在 AWS 实例 x1.32xlarge(128 核)上,我们每秒都会收到很多中断。
以下是每秒中断次数最多的 CPU:
Interrupts Top CPUs
CPU0: 140838.0
CPU1: 77867.0
CPU4: 66495.0
CPU6: 59941.0
CPU3: 39096.0
CPU2: 31532.0
CPU7: 30861.0
CPU5: 26042.0
CPU8: 4168.0
CPU12: 3026.0
CPU10: 2793.0
以下是每秒 CPU 中最多的中断数:
Interrupts above 10k/s
HYP [Hypervisor callback interrupts] [CPU0] = 46902.0/sec
49 [xen-percpu-ipi resched0] [CPU0] = 43437.0/sec
RES [Rescheduling interrupts] [CPU0] = 41512.0/sec
HYP [Hypervisor callback interrupts] [CPU2] = 26638.0/sec
HYP [Hypervisor callback interrupts] [CPU8] = 22875.0/sec
HYP [Hypervisor callback interrupts] [CPU12] = 20813.0/sec
55 [xen-percpu-ipi resched1] [CPU2] = 20749.0/sec
RES [Rescheduling interrupts] [CPU2] = 19568.0/sec
73 [xen-percpu-ipi resched4] [CPU8] = 16400.0/sec
RES [Rescheduling interrupts] [CPU8] = 15677.0/sec
HYP [Hypervisor callback interrupts] [CPU6] = 14226.0/sec
85 [xen-percpu-ipi resched6] [CPU12] = 14060.0/sec
RES [Rescheduling interrupts] [CPU12] = 13271.0/sec
HYP [Hypervisor callback interrupts] [CPU14] = 12173.0/sec
HYP [Hypervisor callback interrupts] [CPU4] = 11887.0/sec
HYP [Hypervisor callback interrupts] [CPU10] = 10500.0/sec
当该机器上运行的应用程序负载很大时,就会发生这种情况。网络流量相对较高,并且线程很多。
我的问题是:每秒 50K/150K 次中断是否太多?我们如何解释这个数字?每秒的中断次数是否有上限?
更新:
这里我们来看一下输出top
结果:
Tasks: 825 total, 3 running, 822 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.6%us, 3.4%sy, 0.0%ni, 83.6%id, 0.0%wa, 0.0%hi, 2.3%si, 0.0%st
Mem: 2014742856k total, 40059184k used, 1974683672k free, 162036k buffers
Swap: 0k total, 0k used, 0k free, 3159112k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32936 ec2-user 20 0 77.3g 11g 29m S 1759.7 0.6 1780:36 java
32118 ec2-user 20 0 64.2g 10g 26m S 1036.9 0.6 62:31.08 java
3 root 20 0 0 0 0 R 70.4 0.0 14:54.84 ksoftirqd/0
12 root 20 0 0 0 0 S 21.2 0.0 6:06.47 ksoftirqd/1
16 root 20 0 0 0 0 S 15.2 0.0 4:33.28 ksoftirqd/2
20 root 20 0 0 0 0 S 12.2 0.0 3:34.12 ksoftirqd/3
28 root 20 0 0 0 0 S 11.9 0.0 3:24.96 ksoftirqd/5
24 root 20 0 0 0 0 S 11.6 0.0 3:26.54 ksoftirqd/4
32 root 20 0 0 0 0 S 10.2 0.0 3:23.56 ksoftirqd/6
36 root 20 0 0 0 0 S 10.2 0.0 3:28.80 ksoftirqd/7
答案1
大多数中断来自网络卡队列,这样可以将负载分散到其他核心上: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html
答案2
如果不知道您的应用程序在做什么以及它产生的负载,就无法判断您的系统是否存在“过多中断”。
您可以使用top
来检查system
负载值。如果负载值很高,则意味着很大一部分 CPU 负载发生在内核上下文中。反过来,这可能是中断风暴的征兆。