我在 GUEST 节点(Ubuntu Linux 14.04 VM)中运行一些 SD-WAN 应用程序,例如 SDN 交换、路由等,该节点托管在Openstack 环境。
主机节点详细信息:
root@host-node:/var/log/nova# uname -r
4.4.0-71-generic
root@host-node:/var/log/nova# dpkg -l | egrep -i 'qemu|kvm|libvirt'
ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1 all PXE boot firmware - ROM images for qemu
ii libvirt-bin 1.3.1-1ubuntu10.15 amd64 programs for the libvirt library
ii libvirt0:amd64 1.3.1-1ubuntu10.15 amd64 library for interfacing with different virtualization systems
ii python-libvirt 1.3.1-1ubuntu1 amd64 libvirt Python bindings
ii qemu 1:2.5+dfsg-5ubuntu10.14 amd64 fast processor emulator
ii qemu-block-extra:amd64 1:2.5+dfsg-5ubuntu10.16 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-slof 20151103+dfsg-1ubuntu1 all Slimline Open Firmware – QEMU PowerPC version
ii qemu-system 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries
ii qemu-system-arm 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-mips 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (miscelaneous)
ii qemu-system-ppc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-sparc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (x86)
ii qemu-user 1:2.5+dfsg-5ubuntu10.14 amd64 QEMU user mode emulation binaries
ii qemu-utils 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU utilities
CPU
processor : 47
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x38
cpu MHz : 2902.046
cache size : 30720 KB
physical id : 1
siblings : 24
core id : 13
cpu cores : 12
apicid : 59
initial apicid : 59
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs :
bogomips : 5194.87
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
root@host-node:/var/log/nova# free -m
total used free shared buff/cache available
Mem: 773931 186967 490462 4070 96502 576184
Swap: 0 0 0
突然间,无法通过任何方式访问 GUEST 节点(SDN 应用程序运行的位置)(无法 ping,没有 SSH)。甚至 Openstack Horizon Dashboard 控制台也冻结了。
唯一的补救措施是重新启动(硬重启)GUEST 节点,然后它就可以启动并运行,不会出现任何问题。
以下是发布的控制台日志。
[1263400.052002] BUG: soft lockup - CPU#0 stuck for 22s! [XXXX:1722]
[1263436.850392] BUG: soft lockup - CPU#12 stuck for 31s! [python:2059]
[1263436.855480] BUG: soft lockup - CPU#1 stuck for 57s! [sleep:18861]
[1263436.852476] BUG: soft lockup - CPU#8 stuck for 38s! [monit:1864]
[1263436.850131] [sched_delayed] sched: RT throttling activated
[1263436.855565] Modules linked in:
[1263436.850392] Modules linked in:
[1263436.855924]
[1263436.855924] CPU: 1 PID: 18861 Comm: sleep Tainted: G OE 3.16.0-77-generic #99~14.04.1-Ubuntu
[1263436.852476] CPU: 8 PID: 1864 Comm: monit Tainted: G OE 3.16.0-77-generic #99~14.04.1-Ubuntu
[1263436.850392] CPU: 12 PID: 2059 Comm: python Tainted: G OE 3.16.0-77-generic #99~14.04.1-Ubuntu
[1263436.855924] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[1263436.852476] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[1263436.850392] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[1263436.855924] task: ffff88055f497010 ti: ffff88055bc68000 task.ti: ffff88055bc68000
[1263436.852476] task: ffff880812045180 ti: ffff8800bb840000 task.ti: ffff8800bb840000
[1263436.850392] task: ffff8800bb281460 ti: ffff880811b94000 task.ti: ffff880811b94000
[1263436.852476] RIP: 0010:[<ffffffff81776676>]
[1263436.850392] RIP: 0010:[<ffffffff811b9275>]
GUEST 节点详细信息:
uname -a
Linux Guest-node 3.16.0-77-generic #99~14.04.1-Ubuntu SMP Tue Jun 28 19:17:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
uname -r
3.16.0-77-generic
注意:当 GUEST 节点冻结时,Openstack 环境中的 HOST Compute 显示所有 16 个核心的 CPU 使用率为 100%。
Threads: 19 total, 16 running, 3 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.8 us, 6.5 sy, 0.0 ni, 89.5 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 79250592+total, 55210470+free, 18854126+used, 51860008 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 59347756+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20666 libvirt+ 20 0 37.586g 0.013t 26076 R 99.9 1.8 20126:42 qemu-system-x86
20654 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20123:30 qemu-system-x86
20655 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:04 qemu-system-x86
20656 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:04 qemu-system-x86
20657 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:00 qemu-system-x86
20658 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20126:57 qemu-system-x86
20659 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20126:54 qemu-system-x86
20660 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20126:58 qemu-system-x86
20661 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20126:54 qemu-system-x86
20662 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:37 qemu-system-x86
20663 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20128:59 qemu-system-x86
20664 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20128:48 qemu-system-x86
20665 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:01 qemu-system-x86
20667 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20128:41 qemu-system-x86
20668 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20127:04 qemu-system-x86
20669 libvirt+ 20 0 37.586g 0.013t 26076 R 93.8 1.8 20006:54 qemu-system-x86
20597 libvirt+ 20 0 37.586g 0.013t 26076 S 0.0 1.8 4:38.56 qemu-system-x86
20647 libvirt+ 20 0 37.586g 0.013t 26076 S 0.0 1.8 0:00.03 qemu-system-x86
20671 libvirt+ 20 0 37.586g 0.013t 26076 S 0.0 1.8 0:01.83 qemu-system-x86
观察:
HOST 计算节点的内核版本:root@host-node:/var/log/nova# uname -r 4.4.0-71-generic
因此尚不清楚上述问题是否与 Linux 内核版本有关。
解决方法 1 下面的链接接近所报告的问题,在像 Openstack 云基础设施这样的虚拟环境中实现相同的问题是否会出现问题。
https://customerhelp.co.za/linux/ubuntu/fix-ubuntu-bug-soft-lockup-cpu-stuck-vmware-server.html
注意:我们实施了上述解决方案,但问题仍然仍然存在。
解决方法 2根据进一步的分析,Ubuntu 的内核版本存在已知的软锁定问题。
roothost-node:/var/log/nova# uname -r
4.4.0-71-generic
如果有人能提供关于上述问题的任何信息,我将不胜感激。谢谢。