大家好,提前谢谢大家!我和我的团队一直被我们用于科学研究的集群的稳定性所困扰。我们有很多科学和软件工程经验,但运行集群的经验并不多。我会尽量简短地讲。
我们运行一个由大约 10 台机器组成的 HPC 集群,每台机器配备 4 到 8 个 NVIDIA GeForce 1080 GTX GPU,用于科学计算。机器本身是 Supermicro GPU SuperServer(我们有几种不同的型号)。每块主板都有两个通用 NIC,其中只有一个连接到我们的网络。此外,机器还有一个独立的管理 (IPMI) NIC,它们也连接到(同一个网络)。注意:所有 NIC 都连接到同一个子网。网络由 Meraki MX84 路由器运行,24 端口 Netgear 路由器位于路由器和机器之间。
另外还有两台特殊机器;一台运行 MAAS,这是我们用来管理集群的。另一台有 RAID 控制器和几 TB 的 RAID5 阵列。所有机器都通过 NFS 连接到这台机器。
所有机器都运行 Ubuntu Server 16.04
这些机器位于距离我们办公室约一小时车程的托管中心。我们有两种方式连接这些机器:1) 通过 Meraki 提供的 VPN 接入网络;2) 通过反向隧道 ssh 连接到我们在云中运行的另一台机器。
正常情况下,我们在 GPU 机器上运行 CPU 和 GPU 密集型作业,这些作业从 NFS 安装的 RAID 阵列加载必要的数据。
问题:系统不稳定!这些机器最多只能运行几天,然后一切就会变得很糟糕。以下是糟糕透顶的症状:
- 大部分机器都无法连接(SSH 和 VPN 都无法连接)。
- 无法访问的机器也无法通过 IPMI 访问
- 有些机器可以连接,但 shell 速度很慢(我的意思是你可以输入命令,但按键和响应之间有明显的滞后;感觉很像网络问题)
- 我们可以进入的那些机器似乎已经中断了出站 Internet 连接。具体来说,
ping google.com
导致 DNS 解析问题:unknown host google.com
- 仅软重启机器是不够的;要恢复功能,我们必须通过远程 PDU 进行电源循环。
我们的调查显示,我们根本无法进入的机器实际上仍然处于活动状态;是网络问题阻止了我们对它们的访问。文章底部是我从重新启动后其中一台“死机”的机器中提取的日志。您看到的是正常的 DHCP 活动定期发生,直到凌晨 3:00 左右,此时 DHCPDISCOVER 广播开始失败。当然,此时 ssh 隧道(使用 运行autossh
)开始失败。
我最初的想法是,罪魁祸首是 MAAS,因为我们使用的是它的 DHCP 服务器,而不是 Meraki 路由器提供的服务器。为了验证这个理论,我重新安装了 MAAS,重建了集群,这次使用的是 Meraki 的 DHCP 服务,而不是 MAAS 的 DHCP 服务。两天后,系统以标准方式失败,所以我认为我已经排除了 MAAS(至少就 DHCP 而言)。
我们团队中的一些人直觉地认为 NFS 是罪魁祸首。理论上,NFS 出现故障,然后其他一切都会崩溃。我们知道,当 NFS 崩溃时,客户端文件系统很难恢复,但不清楚这会如何影响网络。
任何有关此问题的帮助都将非常有帮助。正如我所说;我们都没有太多运行集群的经验,因此,最好能给出一些关于从哪里开始查找的建议。如果能给出一些关于具体问题是什么以及如何解决它的建议就更好了。
提前致谢!
日志示例:
Apr 11 02:02:31 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 252 seconds.
Apr 11 02:06:43 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:06:44 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:06:44 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 291 seconds.
Apr 11 02:11:35 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:11:35 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:11:35 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 239 seconds.
Apr 11 02:15:35 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:15:35 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:15:35 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 275 seconds.
Apr 11 02:17:01 cluster9 CRON[7877]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 11 02:20:11 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:20:11 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:20:11 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 250 seconds.
Apr 11 02:24:21 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:24:22 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:24:22 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 279 seconds.
Apr 11 02:29:01 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:29:01 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:29:01 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 288 seconds.
Apr 11 02:33:49 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:33:49 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:33:49 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 281 seconds.
Apr 11 02:38:30 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:38:30 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:38:30 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 296 seconds.
Apr 11 02:43:26 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:43:26 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:43:26 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 270 seconds.
Apr 11 02:47:56 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:47:56 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:47:56 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 260 seconds.
Apr 11 02:52:16 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:52:16 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:52:16 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 276 seconds.
Apr 11 02:56:52 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 02:56:52 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 02:56:52 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 254 seconds.
Apr 11 03:01:06 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:01:06 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 03:01:06 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 241 seconds.
Apr 11 03:01:30 cluster9 systemd[1]: Started Session 488 of user ubuntu.
Apr 11 03:05:07 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:05:07 cluster9 dhclient[1558]: DHCPACK of 192.168.128.120 from 192.168.128.101
Apr 11 03:05:07 cluster9 dhclient[1558]: bound to 192.168.128.120 -- renewal in 290 seconds.
Apr 11 03:09:57 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)
Apr 11 03:13:51 cluster9 dhclient[1558]: message repeated 18 times: [ DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 192.168.128.101 port 67 (xid=0x5f5d62a8)]
Apr 11 03:14:04 cluster9 dhclient[1558]: DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 255.255.255.255 port 67 (xid=0x5f5d62a8)
Apr 11 03:15:05 cluster9 dhclient[1558]: message repeated 5 times: [ DHCPREQUEST of 192.168.128.120 on enp3s0f0 to 255.255.255.255 port 67 (xid=0x5f5d62a8)]
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Withdrawing address record for 192.168.128.120 on enp3s0f0.
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Leaving mDNS multicast group on interface enp3s0f0.IPv4 with address 192.168.128.120.
Apr 11 03:15:08 cluster9 avahi-daemon[1465]: Interface enp3s0f0.IPv4 no longer relevant for mDNS.
Apr 11 03:15:08 cluster9 systemd[1]: Stopping Network Time Synchronization...
Apr 11 03:15:08 cluster9 systemd[1]: Stopped Network Time Synchronization.
Apr 11 03:15:08 cluster9 systemd[1]: Starting Network Time Synchronization...
Apr 11 03:15:08 cluster9 systemd[1]: Started Network Time Synchronization.
Apr 11 03:15:08 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 3 (xid=0x3bb49111)
Apr 11 03:15:11 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 5 (xid=0x3bb49111)
Apr 11 03:15:16 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:15:27 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 16 (xid=0x3bb49111)
Apr 11 03:15:43 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x3bb49111)
Apr 11 03:16:03 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 16 (xid=0x3bb49111)
Apr 11 03:16:19 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 20 (xid=0x3bb49111)
Apr 11 03:16:36 cluster9 autossh[13532]: timeout polling to accept read connection
Apr 11 03:16:36 cluster9 autossh[13532]: port down, restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 476)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8161
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 477)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8162
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 478)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8163
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 479)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8164
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 480)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8165
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:36 cluster9 autossh[13532]: starting ssh (count 481)
Apr 11 03:16:36 cluster9 autossh[13532]: ssh child pid is 8166
Apr 11 03:16:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:38 cluster9 autossh[13532]: starting ssh (count 482)
Apr 11 03:16:38 cluster9 autossh[13532]: ssh child pid is 8167
Apr 11 03:16:38 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:39 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 19 (xid=0x3bb49111)
Apr 11 03:16:46 cluster9 autossh[13532]: starting ssh (count 483)
Apr 11 03:16:46 cluster9 autossh[13532]: ssh child pid is 8168
Apr 11 03:16:46 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:16:58 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 19 (xid=0x3bb49111)
Apr 11 03:17:01 cluster9 CRON[8170]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 11 03:17:04 cluster9 autossh[13532]: starting ssh (count 484)
Apr 11 03:17:04 cluster9 autossh[13532]: ssh child pid is 8172
Apr 11 03:17:04 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:17:17 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:17:28 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:17:36 cluster9 autossh[13532]: starting ssh (count 485)
Apr 11 03:17:36 cluster9 autossh[13532]: ssh child pid is 8173
Apr 11 03:17:36 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:17:39 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 12 (xid=0x3bb49111)
Apr 11 03:17:51 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 17 (xid=0x3bb49111)
Apr 11 03:18:08 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 11 (xid=0x3bb49111)
Apr 11 03:18:19 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 18 (xid=0x3bb49111)
Apr 11 03:18:26 cluster9 autossh[13532]: starting ssh (count 486)
Apr 11 03:18:26 cluster9 autossh[13532]: ssh child pid is 8174
Apr 11 03:18:26 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:18:37 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 10 (xid=0x3bb49111)
Apr 11 03:18:47 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 14 (xid=0x3bb49111)
Apr 11 03:19:01 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 21 (xid=0x3bb49111)
Apr 11 03:19:22 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 9 (xid=0x3bb49111)
Apr 11 03:19:31 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 13 (xid=0x3bb49111)
Apr 11 03:19:38 cluster9 autossh[13532]: starting ssh (count 487)
Apr 11 03:19:38 cluster9 autossh[13532]: ssh child pid is 8175
Apr 11 03:19:38 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:19:44 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 7 (xid=0x3bb49111)
Apr 11 03:19:51 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 8 (xid=0x3bb49111)
Apr 11 03:19:59 cluster9 dhclient[1558]: DHCPDISCOVER on enp3s0f0 to 255.255.255.255 port 67 interval 10 (xid=0x3bb49111)
Apr 11 03:20:09 cluster9 dhclient[1558]: No DHCPOFFERS received.
Apr 11 03:20:09 cluster9 dhclient[1558]: No working leases in persistent database - sleeping.
Apr 11 03:21:16 cluster9 autossh[13532]: starting ssh (count 488)
Apr 11 03:21:16 cluster9 autossh[13532]: ssh child pid is 8182
Apr 11 03:21:16 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:23:24 cluster9 autossh[13532]: starting ssh (count 489)
Apr 11 03:23:24 cluster9 autossh[13532]: ssh child pid is 8183
Apr 11 03:23:24 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:26:06 cluster9 autossh[13532]: starting ssh (count 490)
Apr 11 03:26:06 cluster9 autossh[13532]: ssh child pid is 8185
Apr 11 03:26:06 cluster9 autossh[13532]: ssh exited with error status 255; restarting ssh
Apr 11 03:26:36 cluster9 autossh[13532]: starting ssh (count 491)