我们的一个由 XCP 驱动的生产集群突然失去响应。重启并进行一些调查后,我们在 dom0 机器 syslog 中发现了这样的日志:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659040] irq 339: nobody cared (try booting with the "irqpoll" option)
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659058] Pid: 0, comm: swapper/3 Tainted: G C O 3.2.0-24-generic #37-Ubuntu
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659060] Call Trace:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659062] <IRQ> [<ffffffff810db37d>] __report_bad_irq+0x3d/0xe0
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659071] [<ffffffff810db605>] note_interrupt+0x135/0x190
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659074] [<ffffffff810d8e69>] handle_irq_event_percpu+0xa9/0x220
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659078] [<ffffffff8130ff3b>] ? radix_tree_lookup+0xb/0x10
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659081] [<ffffffff810d9031>] handle_irq_event+0x51/0x80
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659084] [<ffffffff810dc187>] handle_edge_irq+0x87/0x140
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659089] [<ffffffff813a8829>] __xen_evtchn_do_upcall+0x199/0x250
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659092] [<ffffffff813aa96f>] xen_evtchn_do_upcall+0x2f/0x50
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659096] [<ffffffff81666d3e>] xen_do_hypervisor_callback+0x1e/0x30
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659097] <EOI> [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659104] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659107] [<ffffffff8100a1d0>] ? xen_safe_halt+0x10/0x20
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659110] [<fff
cat /proc/interrupts 中的 IRQ 339:
339: ... xen-pirq-msi-x eth0
其中 eth0 是硬件 NIC。
虽然主机似乎挂起了,但客户机仍在继续工作,因此我们在其中一个虚拟主机上进行的微型内部监控记录了类似这样的内容:
[2012-10-26 20:31:51] [OK......] 200 OK : 113159149 ns
[2012-10-26 20:32:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 47763284432 ns
...
[2012-10-26 20:34:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 46894835070 ns
[2012-10-26 20:34:57] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 16821741955 ns
...
[2012-10-26 20:38:18] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 20103298289 ns
[2012-10-26 20:38:37] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 17895754943 ns
主机和客户操作系统:Ubuntu 12.04 LTS,
05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: ASUSTeK Computer Inc. Device 8369
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at fe500000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory at fe520000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: e1000e
Kernel modules: e1000e
有什么提示可以指导如何调试这个吗?