用户进程中的软锁定和死锁

用户进程中的软锁定和死锁

我有一个问题,在客户机器上,用户空间进程和 2 个内核进程一起占用了处理器(软锁定),并且转储堆栈跟踪显示所有 3 个进程中的 _ticket_spin_lock 处都有 RIP。

据我所知“如果用户空间进程导致了软锁定,则会记录一行通过其 pid 标识该进程的内容,后跟各种 CPU 寄存器的内容,而没有任何种类的调用跟踪”,但就我而言,我也得到了用户进程的转储堆栈跟踪。

它是来自行为不当的用户空间应用程序吗?它是软锁定的正常功能吗?如果是软锁定的功能,那么如何解决这个问题?

任何帮助都将受到高度赞赏。

它是 x86_64 机器,内核是 3.1.10。我知道所有 3 个进程都在等待 _ticket_spin_lock。参见:-

Aug 26 09:31:58 at-vie01a-cq21b kernel: [115452.492033] BUG: soft lockup - CPU#3 stuck for 22s! [virtio_shm/5/3:7874] 
Aug 26 09:32:00 at-vie01a-cq21b kernel: [115455.404215] BUG: soft lockup - CPU#31 stuck for 23s! [kni_thread:6605] 
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172014] BUG: soft lockup - CPU#0 stuck for 22s! [gis:14145]

这里 gis 是我的用户空间进程,但有调用跟踪。

Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172014] BUG: soft lockup - CPU#0 stuck for 22s! [gis:14145]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172017] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172098] CPU 0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172099] Modules linked in: xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 crypto_null af_key iptable_mangle ip_set arptable_filter arp_tables iptable_raw iptable_nat nfnetlink_log nfnetlink ipt_ULOG ipt_PORTMAP af_packet zlib zlib_deflate sha512_generic sha256_generic sha1_generic md5 icp_qa_al pcie8120 rte_kni pfe_pep virtio_rte virtio_shm virtio_vtnet virtio_uio igb_uio virtio_ring virtio uio xt_tcpudp xt_state xt_pkttype nf_conntrack_control bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf ipmi_devintf ipmi_si ipmi_msghandler edd nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage uas iTCO_wdt ioatdma pcspkr iTCO_vendor_support ixgbe igb wmi i2c_i801 mdio dca sg button container ipv6 autofs4 usbhid ehci_hcd megasr(P) usbcore processor thermal_sys
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172163]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172166] Pid: 14145, comm: gis Tainted: P 3.1.10-gb20-default #1 Intel Corporation S2600CO/S2600CO
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172170] RIP: 0010:[<ffffffff8102064d>] [<ffffffff8102064d>] __ticket_spin_lock+0x15/0x1b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172178] RSP: 0000:ffff88043ee03cf0 EFLAGS: 00000293
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172180] RAX: 00000000000069bf RBX: 00000000020110ac RCX: 000000000000000e
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172182] RDX: 00000000000069bc RSI: 000000000000000e RDI: ffff88041e56a484
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172184] RBP: ffff88041e56a484 R08: ffff88041e56a740 R09: ffff8804154a5840
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172187] R10: 00007f0afce77000 R11: 0000000000000000 R12: ffff88043ee03c68
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172189] R13: ffffffff813f831e R14: ffff88041e56a484 R15: ffff88041e568280
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172192] FS: 00007f0afd70b700(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172194] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172196] CR2: 00007f54f6b88098 CR3: 000000042427e000 CR4: 00000000000406f0
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172199] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172201] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172204] Process gis (pid: 14145, threadinfo ffff88037537e000, task ffff88036a8fe180)
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172205] Stack:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172207] ffffffff8106b766 ffffffffa05e3a1e 0000000101b72e68 ffff8808260ae680
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172213] 0000002e1e568280 ffff880420450000 ffff88041f887a00 ffff880420450000
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172218] ffffffff8192a870 0000000000000608 0000000000000000 ffffffff81928b00
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172224] Call Trace:
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172233] [<ffffffff8106b766>] do_raw_spin_lock+0x5/0x8
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172240] [<ffffffffa05e3a1e>] packet_rcv+0x254/0x2ab [af_packet]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172257] [<ffffffff81337bbf>] __netif_receive_skb+0x2e1/0x36b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172262] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172266] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172277] [<ffffffffa031adee>] igb_clean_rx_irq+0x30d/0x39e [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172298] [<ffffffffa031aecd>] igb_poll+0x4e/0x74 [igb]
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172313] [<ffffffff81339c88>] net_rx_action+0x65/0x178
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172319] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172324] [<ffffffff813f9aac>] call_softirq+0x1c/0x30
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172329] [<ffffffff81003931>] do_softirq+0x3c/0x7b
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172333] [<ffffffff81045f98>] irq_exit+0x3c/0xac
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172337] [<ffffffff81003655>] do_IRQ+0x82/0x98
Aug 26 09:32:01 at-vie01a-cq21b kernel: [115456.172342] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e

答案1

igb根据回溯判断,这听起来更像是网络接口驱动程序或网络数据包驱动程序中的某种错误af_packet。您的gis用户空间进程可能是该计算机上与网络通信的主要进程,因此它看起来是连接的,但软锁定实际上是内核空间错误。

作为第一步,我建议检查更高版本内核中驱动程序的更新日志,看看是否值得进行内核升级。一般来说,3.1.x 是不久前发布的,目前还没有被标记为稳定或长期内核http://www.kernel.org/,因此建议您切换到一个(例如 3.2.x 现在仍然标记为“长期”),除非您付钱让别人为您维护 3.1.x 内核。

如果你找不到明确的升级理由,请在 Linux MAINTAINERS 文件中查找联系信息,网址为https://www.kernel.org/doc/linux/MAINTAINERS并发送错误报告,那里的人会给您更好的建议。

答案2

我已经解决了这个死锁。问题出在 dpdk 源代码中。kni_thread 模块用于 dpdk 将用户应用程序与网络内核堆栈连接起来。

内核:[115455.404215] BUG:软锁定 - CPU#31 卡住 23 秒![kni_thread:6605]。

kni_thread 是由 kthread 调用的,它是用户上下文,它正在调用中断上下文函数。

相关内容