硬件:类型:HP Proliant ML350 G5 RAM 22GB CPU 1 x Intel Xenon E5405 2.00GHz
OP:ESXi 5.5 刚刚从 5.1 更新,以尝试修复在同一硬件上 ESXi 5.1 上发生的崩溃。
我正在尝试查找我们的一台服务器崩溃的原因,它已经在 24 小时内锁定了两次。前面的内部错误指示灯闪烁红色,内部只有“#5 和 #6 第 76 页手册”的“处理器 2”指示灯呈“琥珀色”亮起,而“电源”指示灯呈“绿色”亮起。
在日志中,我唯一能在相关时间范围内看到的错误是在日志下。这是原因吗?或者我还能做其他什么来尝试记录/定位错误。
来自 zcat syslog.6.gz | less
2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:53Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:57Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:01Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:04Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:15Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:23Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:27Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:31Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:46Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:48Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
更新
设置 iLO 2 并访问日志确实显示了一些进展,我收到了很多电源已移除的消息。所以我开始怀疑电源,移除 UPS 后,服务器已经稳定了 5 天。
Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power restored.
Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power removed.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power restored.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power removed.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power restored.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power removed.
更新 2
仍然不稳定,24 小时内又崩溃了 2 次
日志中也一样
Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
2
Server power removed.
Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
3
Server power restored.
发生这种情况后,iLO 界面保持打开状态。IML 日志为空,不显示任何内容
更新 3
Status Summary
Server Name: esx01.xx.xx; ProLiant ML350 G5
UUID: 32393534-3937-5A43-4A38-353130393248
Server Serial Number / Product ID: CZJ851092H / 459279-425
System ROM: D21 11/02/2008; backup system ROM: 11/02/2008
System Health: Ok
Internal Health LED: Ok
Server Power:
ON
UID Light:
OFF
Last Used Remote Console:
Remote Console
Latest IML Entry: IML Cleared (iLO 2 user:xxx)
iLO 2 Name: ILOCZJ851092H
License Type: iLO 2 Standard
iLO 2 Firmware Version: 1.61 08/31/2008
IP address: 192.168.2.2
Active Sessions: iLO 2 user:xxx
Latest iLO 2 Event Log Entry: Browser login: xxx - 172.20.1.105(DNS name not found).
iLO 2 Date/Time: 06/13/2014 23:22:52
答案1
你可能遇到了硬件问题。这是不是VMware ESXi 的一个问题。
- 您使用的是哪个版本的 ESXi?
- 服务器硬件/BIOS 的固件版本是什么?
- 您提到的其他 ESXi 主机是否由相同的硬件组成?
你最好的选择是检查HP 集成管理日志(IML)服务器。您可以通过国际劳工组织 2界面。
- 登录 ILO,检查硬件系统状态选项卡。主摘要屏幕可能会告诉您哪里出了问题。
- 此外,请查看“系统状态”选项卡下的 IML 选项。这将告诉您服务器崩溃的原因。
就这样。您可能遇到了 RAM、CPU 或系统板问题。
编辑:更新主机的固件,请!!- 不要成为统计!
下载当前可启动固件 DVD适用于您系统的更新已在此处。请使用它启动您的系统并让其更新所有组件。该服务器上的所有内容看起来都像 2008 年的。在使用 HP 服务器硬件时,这是大忌。