经过几个小时的纠正工作后,我们的 proliant 服务器停止计算,系统健康指示灯 12 闪烁,根据文档(http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01706108-8.pdf) 是“检测到严重系统故障(处理器、内存、调节器、热事件、风扇、NMI)” (第 96 页) 的标志。
然后 SSH 就丢失了。我们可以重启并重新获取 ssh(我不在现场),但我不知道该检查什么?有没有日志文件可以找到一些信息?
我找到了这个指南:http://denis.herve.free.fr/trsfrt/HProliant.pdf但对我来说似乎太过强烈了。
我的同事认为可能是 RAM + Swap 过载导致整个服务器崩溃。我不太同意他的观点,因为就我而言,内存问题不会导致严重的系统故障。你对这一点有什么看法吗?
我想知道这是否与我之前的帖子有关系:Linux 服务器在内存完全填满之前进行交换。
我们使用的是 ubuntu 14.04。
PS:服务器在地下室,早上可能会有一点水汽凝结......
编辑 根据@Hennes 的评论,我们将服务器移回了客厅。但经过一夜的计算,它又开始闪烁红灯 :-(
现在我正试图弄清楚日志文件。我们今天早上 09:44 左右重启了服务器,以下是最近更改的文件:
在哪里搜索什么来获取一些错误信息?
我试过 :
romain@pl:/var/log$ cat syslog | grep error
Dec 27 12:00:23 pl kernel: [ 1.053210] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 12:00:23 pl kernel: [ 6.740763] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 6.741967] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 7.082169] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 7.112776] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [ 9.905224] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 27 11:52:18 pl kernel: [ 1.053048] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 11:52:18 pl kernel: [ 6.364768] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.365903] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.684685] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 6.686080] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [ 11.211120] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 28 09:46:55 pl kernel: [ 1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 28 09:46:55 pl kernel: [ 6.348693] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.349786] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.699099] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 6.731027] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [ 8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
和 :
romain@pl:/var/log$ cat dmesg | grep error
[ 1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[ 6.348693] ata3.00: failed to enable AA (error_mask=0x1)
[ 6.349786] ata3.00: failed to enable AA (error_mask=0x1)
[ 6.699099] ata4.00: failed to enable AA (error_mask=0x1)
[ 6.731027] ata4.00: failed to enable AA (error_mask=0x1)
[ 8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
-> 这里我不太明白第一列的值是什么,比如 [ 6.731027] :它是自启动以来的秒数吗?
我检查了
romain@pl:/var/log$ cat syslog | grep memory
Dec 27 12:00:23 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 27 12:00:23 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
[...]
Dec 27 12:00:23 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 12:00:23 pl kernel: [ 0.000000] Early memory node ranges
Dec 27 12:00:23 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 12:00:23 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 12:00:23 pl kernel: [ 0.019764] Initializing cgroup subsys memory
Dec 27 12:00:23 pl kernel: [ 0.019992] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 12:00:23 pl kernel: [ 0.971501] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 12:00:23 pl kernel: [ 0.972518] Scanning for low memory corruption every 60 seconds
Dec 27 12:00:23 pl kernel: [ 6.154807] memory memory67: hash matches
Dec 27 12:00:23 pl kernel: [ 6.205519] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 12:00:23 pl kernel: [ 6.234958] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 12:00:23 pl kernel: [ 6.254602] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 12:00:23 pl kernel: [ 9.739558] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 12:00:32 pl kernel: [ 20.152332] cgroup: docker-runc (2183) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 12:00:32 pl kernel: [ 20.152335] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 27 11:52:18 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 27 11:52:18 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 27 11:52:18 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 27 11:52:18 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 11:52:18 pl kernel: [ 0.000000] Early memory node ranges
Dec 27 11:52:18 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 11:52:18 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 11:52:18 pl kernel: [ 0.019779] Initializing cgroup subsys memory
Dec 27 11:52:18 pl kernel: [ 0.020005] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 11:52:18 pl kernel: [ 0.970708] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 11:52:18 pl kernel: [ 0.971734] Scanning for low memory corruption every 60 seconds
Dec 27 11:52:18 pl kernel: [ 5.854654] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 11:52:18 pl kernel: [ 5.883624] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 11:52:18 pl kernel: [ 5.902731] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 11:52:18 pl kernel: [ 10.983190] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 11:52:25 pl kernel: [ 19.933483] cgroup: docker-runc (2140) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 11:52:25 pl kernel: [ 19.933486] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 28 09:46:55 pl kernel: [ 0.000000] Scanning 1 areas for low memory corruption
Dec 28 09:46:55 pl kernel: [ 0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 28 09:46:55 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 28 09:46:55 pl kernel: [ 0.000000] init_memory_mapping: [mem 0x100000000-0x51fffffff]
Dec 28 09:46:55 pl kernel: [ 0.000000] Early memory node ranges
Dec 28 09:46:55 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 28 09:46:55 pl kernel: [ 0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 28 09:46:55 pl kernel: [ 0.020007] Initializing cgroup subsys memory
Dec 28 09:46:55 pl kernel: [ 0.020233] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 28 09:46:55 pl kernel: [ 0.970821] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 28 09:46:55 pl kernel: [ 0.971834] Scanning for low memory corruption every 60 seconds
Dec 28 09:46:55 pl kernel: [ 5.824432] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 28 09:46:55 pl kernel: [ 5.853109] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 28 09:46:55 pl kernel: [ 5.871990] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 28 09:46:55 pl kernel: [ 8.826997] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 28 09:47:04 pl kernel: [ 19.154325] cgroup: docker-runc (2171) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 28 09:47:04 pl kernel: [ 19.154328] cgroup: "memory" requires setting use_hierarchy to 1 on the root
我还在系统日志文件中检查了‘fan’,‘nmi’,‘critical’,没有任何输出。
我记得一些 stackoverflow 问题,人们将整个文件复制/粘贴到外部日志文件网站 - 我不记得名字了 - 如果有人感兴趣,我准备将文件放在网上。
欢迎任何关于在哪里搜索什么关键词的提示。
我们使用带有 docker 和 r-studio 服务器的服务器进行 ML 微积分。我真的怀疑这种使用方式可能是导致此问题的根源,但在 IT 领域我们永远不知道,所以我会明确指出这一点 ;)
感谢您的任何想法。
答案1
假设您的系统是您链接的文档中提到的 ML150 G6,我强烈建议您在系统上设置并使用 Lights Out-100 管理功能。
可以找到基本操作方法这里。一旦您获得了 Lights Out-100 管理的访问权限(我建议您使用 Web 界面,直到您更熟悉 LO100 提供的功能及其使用方式),然后参阅同一文档的第 28-32 页;它展示了如何查看系统的实时传感器和事件信息。通常,如果硬件问题导致重置,它将列在系统事件日志中,在那里找到它会让您了解您的机器发生了什么。无论您是否接触过 LO100,系统事件日志都应该捕获其数据,因此一旦您进入那里,它应该会有一些有趣的事情告诉您。
大部分相同的信息都可以通过你正在运行的操作系统获取,可以通过 /var/log/messages(你已经尝试过但没有成功)或通过 HP 的 Insight 工具获取,这些工具可以安装在一些 Linux 版本上(请参阅http://downloads.linux.hp.com/SDR/project/mcp/是获取这些工具的一个很好的起点)。不幸的是,并非所有事件都会显示在系统日志中,因为它们是特定于硬件的,并且 HP 代理(而不是内核本身)才是对它们进行检测的。
话虽如此,您也可以查看是否已安装并运行 mcelog;它可以捕获一些硬件事件,并且通常在捕获事件时在消息日志中记录一些内容。它通常还会将事件信息记录到单独的日志中,或将其保存在内存中,以便您可以使用 mcelog 命令进行查询。值得mcelog
在您的消息日志中查找,或者查看您是否有最近更新的/var/log/mcelog
文件。