正在接听 如何通过 SSH 连接到 2000 多个节点?我发现了以下内容:
如果出现以下任一情况,并行运行 500 台ssh
到 500 台服务器是没有问题的:
- 服务器不在同一个局域网上(即它们通过不同的路由器路由)。
- 服务器是同一台机器(本地主机)上的 Docker 容器。
- 每 30 毫秒启动一个作业的速度不超过 1 个。
所以这些都有效:
head -n 500 ext.ipaddr | parallel -j 500 ssh {} uptime
head -n 500 localhost.docker.ipaddr | parallel -j 500 ssh {} uptime
head -n 500 local.lan.docker.ipaddr | parallel --delay 0.03 -j 500 ssh {} uptime
我不明白的是为什么这不起作用:
head -n 500 local.lan.docker.ipaddr | parallel -j 500 ssh {} uptime
即ssh
本地 LAN 上的服务器上有 500 个 Docker 容器,没有延迟(有时只有 5 个 Docker 容器就会出现问题)。
当我这样做时,我得到很多“没有主机路由”。
我得出的结论是与arp有关。
在工作案例中我得到类似的东西:
06:15:06.605997 ARP, Request who-has 172.24.0.113 tell 172.24.254.254, length 28
06:15:06.617110 ARP, Reply 172.24.0.113 is-at 02:42:ac:18:00:71, length 46
06:15:06.636660 ARP, Request who-has 172.24.0.115 tell 172.24.254.254, length 28
06:15:06.648457 ARP, Reply 172.24.0.115 is-at 02:42:ac:18:00:73, length 46
06:15:06.660832 ARP, Request who-has 172.22.0.116 tell 172.22.254.254, length 28
06:15:06.672328 ARP, Reply 172.22.0.116 is-at 02:42:ac:16:00:74, length 46
06:15:06.692116 ARP, Request who-has 172.21.0.117 tell 172.21.254.254, length 28
06:15:06.703215 ARP, Reply 172.21.0.117 is-at 02:42:ac:15:00:75, length 46
06:15:06.717891 ARP, Request who-has 172.23.0.117 tell 172.23.254.254, length 28
06:15:06.729403 ARP, Reply 172.23.0.117 is-at 02:42:ac:17:00:75, length 46
06:15:06.752089 ARP, Request who-has 172.24.0.114 tell 172.24.254.254, length 28
06:15:06.764744 ARP, Reply 172.24.0.114 is-at 02:42:ac:18:00:72, length 46
06:15:06.783677 ARP, Request who-has 172.24.0.116 tell 172.24.254.254, length 28
06:15:06.795258 ARP, Reply 172.24.0.116 is-at 02:42:ac:18:00:74, length 46
06:15:06.809392 ARP, Request who-has 172.23.0.118 tell 172.23.254.254, length 28
06:15:06.820770 ARP, Reply 172.23.0.118 is-at 02:42:ac:17:00:76, length 46
06:15:06.842422 ARP, Request who-has 172.21.0.118 tell 172.21.254.254, length 28
06:15:06.853491 ARP, Reply 172.21.0.118 is-at 02:42:ac:15:00:76, length 46
06:15:06.871436 ARP, Request who-has 172.22.0.117 tell 172.22.254.254, length 28
06:15:06.882957 ARP, Reply 172.22.0.117 is-at 02:42:ac:16:00:75, length 46
06:15:06.902872 ARP, Request who-has 172.23.0.120 tell 172.23.254.254, length 28
06:15:06.913643 ARP, Reply 172.23.0.120 is-at 02:42:ac:17:00:78, length 46
06:15:06.932819 ARP, Request who-has 172.21.0.119 tell 172.21.254.254, length 28
06:15:06.944045 ARP, Reply 172.21.0.119 is-at 02:42:ac:15:00:77, length 46
因此,请求之后立即得到答复。
在失败的情况下我得到:
06:17:35.764287 ARP, Request who-has 172.21.0.169 tell 172.21.254.254, length 28
06:17:35.768654 ARP, Request who-has 172.22.0.169 tell 172.22.254.254, length 28
06:17:35.771642 ARP, Request who-has 172.24.0.169 tell 172.24.254.254, length 28
06:17:35.772369 ARP, Request who-has 172.24.0.109 tell 172.24.254.254, length 28
06:17:35.772384 ARP, Request who-has 172.23.0.110 tell 172.23.254.254, length 28
06:17:35.772387 ARP, Request who-has 172.21.0.111 tell 172.21.254.254, length 28
06:17:35.772388 ARP, Request who-has 172.22.0.109 tell 172.22.254.254, length 28
06:17:35.772395 ARP, Request who-has 172.23.0.107 tell 172.23.254.254, length 28
06:17:35.776378 ARP, Request who-has 172.22.0.108 tell 172.22.254.254, length 28
06:17:35.776398 ARP, Request who-has 172.24.0.108 tell 172.24.254.254, length 28
06:17:35.776401 ARP, Request who-has 172.23.0.106 tell 172.23.254.254, length 28
06:17:35.776408 ARP, Request who-has 172.21.0.109 tell 172.21.254.254, length 28
06:17:35.777417 ARP, Request who-has 172.21.0.170 tell 172.21.254.254, length 28
06:17:35.783320 ARP, Request who-has 172.24.0.170 tell 172.24.254.254, length 28
06:17:35.789594 ARP, Request who-has 172.21.0.171 tell 172.21.254.254, length 28
06:17:35.792286 ARP, Request who-has 172.22.0.171 tell 172.22.254.254, length 28
06:17:35.798649 ARP, Request who-has 172.24.0.171 tell 172.24.254.254, length 28
06:17:35.803277 ARP, Request who-has 172.23.0.173 tell 172.23.254.254, length 28
06:17:35.804366 ARP, Request who-has 172.23.0.112 tell 172.23.254.254, length 28
06:17:35.804383 ARP, Request who-has 172.23.0.113 tell 172.23.254.254, length 28
06:17:35.804385 ARP, Request who-has 172.24.0.110 tell 172.24.254.254, length 28
06:17:35.804387 ARP, Request who-has 172.21.0.112 tell 172.21.254.254, length 28
06:17:35.804388 ARP, Request who-has 172.22.0.112 tell 172.22.254.254, length 28
06:17:35.804389 ARP, Request who-has 172.21.0.114 tell 172.21.254.254, length 28
06:17:35.804390 ARP, Request who-has 172.22.0.111 tell 172.22.254.254, length 28
06:17:35.804391 ARP, Request who-has 172.23.0.109 tell 172.23.254.254, length 28
06:17:35.804393 ARP, Request who-has 172.23.0.108 tell 172.23.254.254, length 28
06:17:35.806772 ARP, Request who-has 172.22.0.170 tell 172.22.254.254, length 28
06:17:35.811874 ARP, Request who-has 172.22.0.172 tell 172.22.254.254, length 28
06:17:35.816238 ARP, Request who-has 172.21.0.172 tell 172.21.254.254, length 28
06:17:35.820150 ARP, Request who-has 172.23.0.174 tell 172.23.254.254, length 28
06:17:35.826595 ARP, Request who-has 172.23.0.175 tell 172.23.254.254, length 28
06:17:35.832707 ARP, Request who-has 172.21.0.173 tell 172.21.254.254, length 28
06:17:35.835588 ARP, Request who-has 172.23.0.176 tell 172.23.254.254, length 28
06:17:35.836369 ARP, Request who-has 172.23.0.114 tell 172.23.254.254, length 28
06:17:35.836384 ARP, Request who-has 172.24.0.112 tell 172.24.254.254, length 28
06:17:35.836392 ARP, Request who-has 172.21.0.113 tell 172.21.254.254, length 28
06:17:35.840372 ARP, Request who-has 172.21.0.115 tell 172.21.254.254, length 28
06:17:35.840394 ARP, Request who-has 172.22.0.110 tell 172.22.254.254, length 28
06:17:35.840397 ARP, Request who-has 172.23.0.111 tell 172.23.254.254, length 28
06:17:35.840400 ARP, Request who-has 172.24.0.111 tell 172.24.254.254, length 28
06:17:35.840408 ARP, Request who-has 172.22.0.113 tell 172.22.254.254, length 28
06:17:35.842467 ARP, Request who-has 172.24.0.172 tell 172.24.254.254, length 28
06:17:35.844844 ARP, Request who-has 172.22.0.173 tell 172.22.254.254, length 28
06:17:35.853446 ARP, Request who-has 172.21.0.174 tell 172.21.254.254, length 28
06:17:35.855394 ARP, Request who-has 172.24.0.173 tell 172.24.254.254, length 28
06:17:35.860520 ARP, Request who-has 172.23.0.178 tell 172.23.254.254, length 28
06:17:35.865012 ARP, Request who-has 172.21.0.175 tell 172.21.254.254, length 28
06:17:35.868369 ARP, Request who-has 172.22.0.116 tell 172.22.254.254, length 28
06:17:35.868391 ARP, Request who-has 172.23.0.116 tell 172.23.254.254, length 28
06:17:35.868394 ARP, Request who-has 172.22.0.115 tell 172.22.254.254, length 28
06:17:35.868395 ARP, Request who-has 172.21.0.117 tell 172.21.254.254, length 28
06:17:35.868397 ARP, Request who-has 172.24.0.113 tell 172.24.254.254, length 28
06:17:35.868398 ARP, Request who-has 172.23.0.115 tell 172.23.254.254, length 28
如此多的请求,但没有答案。这解释了“没有到主机的路由”。但这提出了一个新问题:为什么没有回复?
dmesg
运行上述命令时,容器服务器的 syslog 不会显示任何内容:没有“服务器受到 arp 泛洪攻击,停止应答 arp 请求”或类似内容。
LAN 为 1 Gbit/s,没有其他流量。流量占用量级为1Mbit/s。
容器服务器闲置率为 90%,但top
显示ksoftirqd
出几个核心已达到极限:
top - 06:38:38 up 6:33, 4 users, load average: 3.94, 5.16, 4.34
Tasks: 17106 total, 7 running, 17098 sleeping, 1 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.1 sy, 0.0 ni, 91.7 id, 0.0 wa, 0.0 hi, 6.4 si, 0.0 st
GiB Mem : 503.9 total, 162.4 free, 303.6 used, 37.9 buff/cache
GiB Swap: 200.0 total, 200.0 free, 0.0 used. 199.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25 root 20 0 0 0 0 R 100.0 0.0 24:13.88 ksoftirqd/2
31 root 20 0 0 0 0 R 100.0 0.0 1:39.16 ksoftirqd/3
37 root 20 0 0 0 0 R 99.8 0.0 14:31.28 ksoftirqd/4
49 root 20 0 0 0 0 R 99.8 0.0 22:29.80 ksoftirqd/6
2899 root 20 0 20.5g 1.3g 51704 S 29.2 0.3 171:22.30 dockerd
3230170 root 20 0 29912 23504 3428 R 25.7 0.0 0:39.58 top
37 root 20 0 0 0 0 R 100.0 0.0 14:23.61 ksoftirqd/4
这种最大化恰好在运行ssh
s 时发生,没有延迟。延迟 30 毫秒后,核心并未达到最大极限,而是以 30% 的速度运行。
因此,可能的解释是,ksoftirqd
开始服务 arp 请求,但在完成应答请求之前被新请求中断。在这种情况下,它看起来像是糟糕的设计:它可以用于对容器进行 DoS。如果在处理 arp 请求时简单地忽略新的 arp 请求,那就更好了。
这就是解释吗?还是有什么不同的原因?有没有解决方法(除了延迟)?
服务器和客户端均运行Ubuntu 20.04。