多少次中断才算太多?

多少次中断才算太多?

在 AWS 实例 x1.32xlarge(128 核)上,我们每秒都会收到很多中断。

以下是每秒中断次数最多的 CPU:

Interrupts Top CPUs
CPU0: 140838.0
CPU1: 77867.0
CPU4: 66495.0
CPU6: 59941.0
CPU3: 39096.0
CPU2: 31532.0
CPU7: 30861.0
CPU5: 26042.0
CPU8: 4168.0
CPU12: 3026.0
CPU10: 2793.0

以下是每秒 CPU 中最多的中断数:

Interrupts above 10k/s
HYP [Hypervisor callback interrupts] [CPU0] = 46902.0/sec
49 [xen-percpu-ipi resched0] [CPU0] = 43437.0/sec
RES [Rescheduling interrupts] [CPU0] = 41512.0/sec
HYP [Hypervisor callback interrupts] [CPU2] = 26638.0/sec
HYP [Hypervisor callback interrupts] [CPU8] = 22875.0/sec
HYP [Hypervisor callback interrupts] [CPU12] = 20813.0/sec
55 [xen-percpu-ipi resched1] [CPU2] = 20749.0/sec
RES [Rescheduling interrupts] [CPU2] = 19568.0/sec
73 [xen-percpu-ipi resched4] [CPU8] = 16400.0/sec
RES [Rescheduling interrupts] [CPU8] = 15677.0/sec
HYP [Hypervisor callback interrupts] [CPU6] = 14226.0/sec
85 [xen-percpu-ipi resched6] [CPU12] = 14060.0/sec
RES [Rescheduling interrupts] [CPU12] = 13271.0/sec
HYP [Hypervisor callback interrupts] [CPU14] = 12173.0/sec
HYP [Hypervisor callback interrupts] [CPU4] = 11887.0/sec
HYP [Hypervisor callback interrupts] [CPU10] = 10500.0/sec

当该机器上运行的应用程序负载很大时,就会发生这种情况。网络流量相对较高,并且线程很多。

我的问题是:每秒 50K/150K 次中断是否太多?我们如何解释这个数字?每秒的中断次数是否有上限?

更新:

这里我们来看一下输出top结果:

Tasks: 825 total,   3 running, 822 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us,  3.4%sy,  0.0%ni, 83.6%id,  0.0%wa,  0.0%hi,  2.3%si,  0.0%st
Mem:  2014742856k total, 40059184k used, 1974683672k free,   162036k buffers
Swap:        0k total,        0k used,        0k free,  3159112k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                              
 32936 ec2-user  20   0 77.3g  11g  29m S 1759.7  0.6   1780:36 java                                                                                                                                               
 32118 ec2-user  20   0 64.2g  10g  26m S 1036.9  0.6  62:31.08 java                                                                                                                                               
     3 root      20   0     0    0    0 R 70.4  0.0  14:54.84 ksoftirqd/0                                                                                                                                          
    12 root      20   0     0    0    0 S 21.2  0.0   6:06.47 ksoftirqd/1                                                                                                                                          
    16 root      20   0     0    0    0 S 15.2  0.0   4:33.28 ksoftirqd/2                                                                                                                                          
    20 root      20   0     0    0    0 S 12.2  0.0   3:34.12 ksoftirqd/3                                                                                                                                          
    28 root      20   0     0    0    0 S 11.9  0.0   3:24.96 ksoftirqd/5                                                                                                                                          
    24 root      20   0     0    0    0 S 11.6  0.0   3:26.54 ksoftirqd/4                                                                                                                                          
    32 root      20   0     0    0    0 S 10.2  0.0   3:23.56 ksoftirqd/6                                                                                                                                          
    36 root      20   0     0    0    0 S 10.2  0.0   3:28.80 ksoftirqd/7  

更新2: htop 输出

答案1

大多数中断来自网络卡队列,这样可以将负载分散到其他核心上: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html

答案2

如果不知道您的应用程序在做什么以及它产生的负载,就无法判断您的系统是否存在“过多中断”。

您可以使用top来检查system负载值。如果负载值很高,则意味着很大一部分 CPU 负载发生在内核上下文中。反过来,这可能是中断风暴的征兆。

相关内容