kvm - CPU 核心正在循环禁用和启用

kvm - CPU 核心正在循环禁用和启用

在我们带有 KVM 的虚拟化服务器上​​,CPU 核心会在 10 分钟后循环禁用和启用(每次禁用都会导致所有虚拟机挂起 15 秒)。

一周前雷雨天发生过一次,当时所有虚拟服务器都因为数据盘错误而挂了(系统盘没问题)。所以我们换了数据盘。接下来,我们尝试将主机系统从 ubuntu natty(内核 2.6)升级到 ubuntu precise(3.2),没有任何变化。

我只找到一个关于它的论坛,没有解决方案 http://ubuntuforums.org/showthread.php?p=12071553

我尝试打开 kvm 调试

/sys/kernel/debug/tracing/trace_pipe

并在系统日志中通过内核时间找到准确位置,但我不明白日志,也没有发现任何重要的区别

我认为可能是主板信号不好。由于磁盘错误,主板可能出了问题,但我不知道如何查找

有一个带有禁用/启用循环的 syslog 部分

 Jul 14 15:36:44 node-01 kernel: [56713.568733] kvm: disabling virtualization on CPU1
 Jul 14 15:36:44 node-01 kernel: [56713.668842] CPU 1 is now offline
 Jul 14 15:36:44 node-01 kernel: [56713.670835] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5
 Jul 14 15:36:44 node-01 kernel: [56713.673771] kvm: disabling virtualization on CPU2
 Jul 14 15:36:44 node-01 kernel: [56713.674492] CPU 2 is now offline
 Jul 14 15:36:44 node-01 kernel: [56713.680172] kvm: disabling virtualization on CPU3
 Jul 14 15:36:44 node-01 kernel: [56713.681114] CPU 3 is now offline
 Jul 14 15:36:44 node-01 kernel: [56713.681119] SMP alternatives: switching to UP code
 Jul 14 15:36:44 node-01 kernel: [56713.701971] init: anacron main process (3613) killed      by TERM signal
 Jul 14 15:36:44 node-01 kernel: [56713.709803] r8169 0000:01:00.0: eth0: link down
 Jul 14 15:36:44 node-01 kernel: [56713.710421] br0: port 1(eth0) entering forwarding state
 Jul 14 15:36:47 node-01 kernel: [56716.675313] r8169 0000:01:00.0: eth0: link up
 Jul 14 15:36:47 node-01 kernel: [56716.676438] br0: port 1(eth0) entering forwarding state
 Jul 14 15:36:47 node-01 kernel: [56716.676454] br0: port 1(eth0) entering forwarding state
 Jul 14 15:36:56 node-01 kernel: [56725.666787] br0: port 1(eth0) entering forwarding state
 Jul 14 15:37:02 node-01 kernel: [56730.815937] SMP alternatives: switching to SMP code
 Jul 14 15:37:02 node-01 kernel: [56730.825021] Booting Node 0 Processor 1 APIC 0x4
 Jul 14 15:37:02 node-01 kernel: [56730.825025] smpboot cpu 1: start_ip = 9a000
 Jul 14 15:37:02 node-01 kernel: [56730.836033] Calibrating delay loop (skipped) already calibrated this CPU
 Jul 14 15:37:02 node-01 kernel: [56730.837012] kvm: enabling virtualization on CPU1
 Jul 14 15:37:02 node-01 kernel: [56730.858555] NMI watchdog enabled, takes one hw-pmu counter.
 Jul 14 15:37:02 node-01 kernel: [56730.862547] Booting Node 0 Processor 2 APIC 0x1
 Jul 14 15:37:02 node-01 kernel: [56730.862551] smpboot cpu 2: start_ip = 9a000
 Jul 14 15:37:02 node-01 kernel: [56730.873460] Calibrating delay loop (skipped) already calibrated this CPU
 Jul 14 15:37:02 node-01 kernel: [56730.874453] kvm: enabling virtualization on CPU2
 Jul 14 15:37:02 node-01 kernel: [56730.896371] NMI watchdog enabled, takes one hw-pmu counter.
 Jul 14 15:37:02 node-01 kernel: [56730.898581] Booting Node 0 Processor 3 APIC 0x5
 Jul 14 15:37:02 node-01 kernel: [56730.898586] smpboot cpu 3: start_ip = 9a000
 Jul 14 15:37:02 node-01 kernel: [56730.909496] Calibrating delay loop (skipped) already calibrated this CPU
 Jul 14 15:37:02 node-01 kernel: [56730.910227] kvm: enabling virtualization on CPU3
 Jul 14 15:37:02 node-01 kernel: [56730.930644] NMI watchdog enabled, takes one hw-pmu counter.
 Jul 14 15:37:02 node-01 kernel: [56730.963737] r8169 0000:01:00.0: eth0: link down
 Jul 14 15:37:02 node-01 kernel: [56730.964069] br0: port 1(eth0) entering forwarding state
 Jul 14 15:37:04 node-01 kernel: [56733.432535] r8169 0000:01:00.0: eth0: link up
 Jul 14 15:37:04 node-01 kernel: [56733.433808] br0: port 1(eth0) entering forwarding state
 Jul 14 15:37:04 node-01 kernel: [56733.433823] br0: port 1(eth0) entering forwarding state
 Jul 14 15:37:13 node-01 kernel: [56742.424751] br0: port 1(eth0) entering forwarding state

感谢您提供任何提示,关于如何查找错误。

答案1

在我们的案例中,这种现象是在磁盘错误(以及之前的雷雨或电涌)之后开始的。所以我不知道主板是否发出了有关频率/功率/睡眠等的不良信号,或者是 pm-utils 的配置不正确。

卸载包 pm-utils,解决了这个问题。

之前,我们尝试将发行版从 ubuntu natty(内核 2.6)升级到 ubuntu precise(内核 3.2),但没有成功。

我尝试的另一件事是禁用启用/禁用 CPU 核心的可能性(通过 /sys/devices/system/cpu/cpu*/online 文件)。

有一个内核选项 nr_cpus=,可以将其设置为使用的处理器(核心)数量。设置该选项应该会禁用 CPU 热插拔。但就我而言,将其设置为 grub 启动参数后,它没有任何效果(而不是缺少 /sys/devices/system/cpu/cpu*/online 文件)。

nr_cpus = [SMP] Maximum number of processors that   an SMP kernel
        could support.  nr_cpus=n : n >= 1 limits the kernel to
        supporting 'n' processors. Later in runtime you can not
            use hotplug cpu feature to put more cpu back to online.
        just like you compile the kernel NR_CPUS=n

相关内容