排除 VMWare ESXi 5 自发重启故障

排除 VMWare ESXi 5 自发重启故障

大约 3 周前,我将 ESXi 5.0 服务器移至托管服务器,从那时起,我就遇到了服务器自动断电和重新启动的问题。之前,我在度假期间将服务器放在家里将近一个月。在此期间,服务器没有发生过一次宕机。唯一的区别(我知道的)是:

  1. 新的实际位置
  2. 安装了 Dell PERC5i RAID 卡
  3. 目前实际上有一些网站,但从流量或处理器的角度来看,没有什么真正费力的地方

情况有些紧急的原因是,有一次,当 ESXi 和客户机重新启动时,其中一个虚拟机出现文件系统崩溃并进入 RO 模式。我重新启动了该客户机,运行了 fsck,一切恢复正常。我正在试图找出导致这次重新启动的原因,如果经验丰富的 ESXi 用户能够发现我的日志中的任何异常,我将不胜感激。我没有看到任何看起来像内核崩溃或内存转储的内容。以下是我认为在重新启动事件前后相关的日志的摘录……如果我应该添加任何其他日志,请告诉我。

虚拟机概要文件

2012-08-07T17:00:01Z heartbeat: up 2d18h42m11s, 3 VMs; [[3406 vmx 2092436kB] [3453 vmx 2095768kB] [3373 vmx 2300420kB]] [[3531 sfcb-hhrc 2%max] [3432 sfcb-vmware_bas 5%max] [3420 sfcb-pycim 16%max]]
2012-08-07T18:00:01Z heartbeat: up 2d19h42m11s, 3 VMs; [[3406 vmx 2092488kB] [3453 vmx 2095640kB] [3373 vmx 2301544kB]] [[3531 sfcb-hhrc 2%max] [3432 sfcb-vmware_bas 5%max] [3420 sfcb-pycim 16%max]]
2012-08-07T18:58:42Z bootstop: Host has booted
2012-08-07T19:00:01Z heartbeat: up 0d0h2m10s, 3 VMs; [[3405 vmx 464780kB] [3451 vmx 815008kB] [3373 vmx 1086716kB]] [[3501 sfcb-CIMXML-Pro 1%max] [3432 sfcb-vmware_bas 2%max] [3420 sfcb-pycim 5%max]]

系统日志

2012-08-04T20:00:01Z crond[2702]: USER root pid 97212 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T20:01:01Z crond[2702]: USER root pid 97329 cmd /sbin/auto-backup.sh
2012-08-04T21:00:01Z crond[2702]: USER root pid 99638 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T21:01:01Z crond[2702]: USER root pid 99745 cmd /sbin/auto-backup.sh
2012-08-04T22:00:01Z crond[2702]: USER root pid 102014 cmd /usr/lib/vmware/vmksummary/log-heartbeat.py
2012-08-04T22:01:01Z crond[2702]: USER root pid 102081 cmd /sbin/auto-backup.sh
2012-08-04T22:17:54Z jumpstart: dependencies for plugin 'restore-host-cache' not met (missing: vcfs)
2012-08-04T22:17:54Z vmkmicrocode: Warning: Line size is greater than expected size 242
2012-08-04T22:17:54Z vmkmicrocode: File microcode_amd_0x100fa0.bin does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: File m4010676860C0001.dat does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: File m03106a5.dat does not contain a valid microcode update for any of the processors
2012-08-04T22:17:54Z vmkmicrocode: cpu0 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)
2012-08-04T22:17:54Z vmkmicrocode: cpu1 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)
2012-08-04T22:17:54Z vmkmicrocode: cpu2 with revision (a07) can use the update in file microcode-1027.dat
2012-08-04T22:17:54Z vmkmicrocode: update number 25 version(1), revision(2571), date(0x9282010), size(2048)

虚拟机内核日志

2012-08-04T02:59:59.509Z cpu4:2655)<6>megasas_hotplug_work[6]: aen event code 0x0027
2012-08-04T15:57:19.630Z cpu5:2655)<6>megasas_hotplug_work[6]: aen event code 0x005e
2012-08-04T16:03:35.776Z cpu4:2649)<6>megasas_hotplug_work[6]: aen event code 0x005e
TSC: 0 cpu0:0)Boot: 167: Parsing boot option module /useropts.gz
TSC: 14715 cpu0:0)Boot: 173: Parsing command line boot options
TSC: 86415 cpu0:0)BootConfig: 38: coresPerPkg = 0
TSC: 90368 cpu0:0)BootConfig: 41: useNUMAInfo = TRUE
TSC: 93878 cpu0:0)BootConfig: 44: numaLatencyLoops = 20
...
PRESUMABLY MORE BOOT STUFF
...
0:00:00:03.667 cpu0:2048)IDT: 991: 0x30 <keyboard> exclusive, flags 0x3
0:00:00:03.667 cpu0:2048)IDT: 991: 0x58 <mouse> exclusive, flags 0x3
0:00:00:03.667 cpu0:2048)IOAPIC: 1335: 0x58 retriggerred
0:00:00:03.667 cpu0:2048)IOAPIC: 1335: 0x30 retriggerred
0:00:00:03.667 cpu0:2048)GlobalTimer: 78: GlobalTimer service not available
0:00:00:03.667 cpu0:2048)Initializing Power Management ...
0:00:00:03.670 cpu0:2048)Power: 2568: No supported CPU power management technology detected
0:00:00:03.671 cpu0:2048)MCE: 616: Fixed 10 MCE bank/CPU-package ownership settings
0:00:00:03.672 cpu0:2048)CpuSched: 11824: Reset scheduler statistics
0:00:00:03.672 cpu0:2048)Init: 892: Vmkernel initialization done. Returning to console.
0:00:00:03.672 cpu0:2048)VMKernel loaded successfully.
2012-08-04T22:17:52.152Z cpu6:2059)ScsiCore: 129: Starting taskMgmt watchdog world 2059
2012-08-04T22:17:52.152Z cpu4:2060)ScsiCore: 129: Starting taskMgmt watchdog world 2060
2012-08-04T22:17:52.152Z cpu5:2141)VSCSI: 2520: Starting reset handler world 2141/1
2012-08-04T22:17:52.152Z cpu3:2177)ScsiCore: 63: Starting taskmgmt handler world 2177/1
2012-08-04T22:17:52.152Z cpu2:2178)ScsiCore: 63: Starting taskmgmt handler world 2178/1
2012-08-04T22:17:52.152Z cpu5:2142)VSCSI: 2709: Starting reset watchdog world 2142

主机日志

2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0x33f7abc0] : Invoke done [waitForUpdates] on [vmodl.query.PropertyCollector:ha-property-collector]
2012-08-04T22:13:54.996Z [FFEA7AC0 verbose 'Vmomi'] Arg version:
--> "46"
2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Throw vmodl.fault.RequestCanceled
2012-08-04T22:13:54.996Z [FFEA7AC0 info 'Vmomi'] Result:
--> (vmodl.fault.RequestCanceled) {
--> dynamicType = <unset>,
--> faultCause = (vmodl.MethodFault) null,
--> msg = "",
--> }
2012-08-04T22:13:54.997Z [34759B90 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(local=127.0.0.1:0, peer=127.0.0.1:58492) with error N7Vmacore15SystemExceptionE(Connection reset by p
2012-08-04T22:14:13.998Z [340C2B90 verbose 'Proxysvc Req01482'] New proxy client TCP(local=66.196.32.10:80, peer=223.4.119.245:43890)
2012-08-04T22:14:44.561Z [348FBB90 verbose 'vm:/vmfs/volumes/4ffd026d-a15e589f-c6e3-003048d37c09/REDACTED/REDACTED.vmx'] Actual VM overhead: 119980032 bytes
2012-08-04T22:14:44.562Z [348FBB90 verbose 'Vmsvc'] RefreshVms updated overhead for 1 VM
2012-08-04T22:15:07.104Z [34718B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
Section for VMware ESX, pid=2790, version=5.0.0, build=build-623860, option=Release
------ In-memory logs start --------
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] Supported VMs 87
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Handle checker'] Setting system limit of 2222
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Handle checker'] Set system limit to 2222
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] Setting malloc mmap threshold to 32 k
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] getrlimit(RLIMIT_NPROC): curr=64 max=128, return code = Success
2012-08-04T22:18:21.746Z [FFC7CAC0 info 'Default'] setrlimit(RLIMIT_NPROC): curr=128 max=128, return code = Success
------ In-memory logs end --------
2012-08-04T22:18:21.747Z [FFC7CAC0 info 'Default'] Initialized channel manager

我已经排除了:

  • 虚拟机的文件系统发生 R/O 问题 - 我的理解是,单个虚拟机崩溃无法导致 ESXi 崩溃
  • 网络流量激增的问题 - 该虚拟机上的唯一网站在晚上 10:30 左右流量不是很大,查看客户的 Apache 日志和其他内容可以支持这一点

我猜测:

  • 我安装的戴尔 RAID 卡有问题 - 在安装此卡之前,它已经运行了 3 周以上,我将在接下来的几天内安装诊断程序,以便能够监控
  • 可能RAID 卡的吞吐量问题导致虚拟机发出的请求响应缓慢,并导致虚拟机认为文件系统有问题,尽管这不能解释重新启动的原因,但 Linux 应该可以接受,只需标记 FS R/O 并继续运行,直到您修复问题为止,并且如上所述,系统不应该处于负载之下
  • VMWare 是否执行需要重启的自动更新?我没有在任何客户机上安装 VMWare Tools,因此这可能会导致客户机虚拟机异常重启。
  • 托管中心的电源故障 - 将服务器移到那里的第二天早上,我不得不让他们重启我的机器......我怀疑有人关掉了电源板或什么东西,因为我从他们那里得到的回复非常笼统:“我们遇到了电源问题”。此外,几个小时前我们遇到了一场大雷暴,服务器在 20 分钟内至少重启了 3 次,没有文件系统损坏,但对于一个据称由 UPS + 发电机支持的数据中心来说,情况不应该如此
  • 您还能想到什么吗?

答案1

雷暴可能导致任何数量的问题。根据数据中心设施的等级/质量,可能会产生影响。

  • 最有用的日志将显示在 vSphere Client 的“事件”选项卡中。
  • 您有带外管理吗?也许是 DRAC?它可以为您提供有关物理硬件状态的信息。
  • 这实际上是戴尔服务器吗?哪款型号/哪一代?如果是,你应该安装适用于 ESXi 5 的 Dell CIM 代理
  • 您的 PERC/5i 控制器是否具有缓存内存和电池供电的缓存单元 (BBWC)?没有这些会影响写入性能
  • 单个 VMWare ESXi 系统没有任何自动更新功能。
  • 您应该在客户系统上安装 VMWare 工具
  • 您的服务器是否配备双电源,并且可以使用 A/B 电源?如果这是单 PSU 系统,则这可能是罪魁祸首。

相关内容