SSH 连接数增加并阻止数据？

2024-5-29 • tag-icon

我们有一个客户端服务器设置，其中客户端设置一个 SSH 隧道并使用端口转发将数据发送到服务器：

ssh -N -L 5000:localhost:5500 user@serveraddress

服务器上的正常 SSH 连接数为~150，当一切正常时，服务器软件处理传入连接的速度非常快（最多几秒钟）。

然而，最近我们注意到 SSH 连接数上升到 900+。此时，服务器软件看到与其连接的连接并接受这些连接，但没有数据传入。

有人在 SSH 中遇到过这种症状吗？对问题可能是什么有什么想法吗？

Server OS: Red Hat Linux 5.5
Firewall: Disabled
Key Exchange: Tested

编辑：在服务器端从 /var/log/secure 添加部分日志数据

日志文件中似乎有很多以下内容。

Apr 10 00:07:33 myserver sshd[15038]: fatal: Write failed: Connection timed out
Apr 10 00:12:01 myserver sshd[5259]: fatal: Read from socket failed: Connection reset by peer
Apr 10 00:44:48 myserver sshd[17026]: fatal: Write failed: No route to host
Apr 10 02:09:16 myserver sshd[10398]: fatal: Read from socket failed: Connection reset by peer
Apr 10 02:22:47 myserver sshd[24581]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:05:57 myserver sshd[12003]: fatal: Read from socket failed: Connection reset by peer
Apr 10 03:23:19 myserver sshd[22421]: fatal: Write failed: Connection timed out
Apr 10 08:13:43 myserver sshd[31993]: fatal: Read from socket failed: Connection reset by peer
Apr 10 08:36:39 myserver sshd[7759]: fatal: Read from socket failed: Connection reset by peer
Apr 10 09:02:32 myserver sshd[12470]: fatal: Write failed: Broken pipe
Apr 10 12:08:05 myserver sshd[728]: fatal: Write failed: Connection reset by peer
Apr 10 12:35:53 myserver sshd[6184]: fatal: Read from socket failed: Connection reset by peer
Apr 10 12:43:14 myserver sshd[2663]: fatal: Write failed: Connection timed out

笔记：在超过 900 个连接的情况下，大约 10-15 分钟后，系统将自行恢复 - 连接数将降至正常范围，服务器将再次开始获取数据。这听起来像是 DOS/DDOS，但这是在内部网络上。

附录：根据 @kranteg 的问题检查了连接状态。我们刚刚又遇到了一次中断，以下是我为所有传入 SSH 连接编写的脚本得出的结果：

===                                                        
Tue Apr 15 12:22:07 EDT 2014 -> Total SSH connections: 996 
===                                                        
0 SYN_SENT                                             
1 SYN_RECV                                             
0 FIN_WAIT1                                            
0 FIN_WAIT2                                            
15 TIME_WAIT                                            
0 CLOSED                                               
760 CLOSE_WAIT                                           
143 ESTABLISHED                                          
77 LAST_ACK                                             
0 LISTEN                                               
0 CLOSING                                              
0 UNKNOWN                                              
===                                                        
===
Tue Apr 15 12:22:17 EDT 2014 -> Total SSH connections: 977
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
15 TIME_WAIT
0 CLOSED
756 CLOSE_WAIT
127 ESTABLISHED
76 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===
===
Tue Apr 15 12:22:26 EDT 2014 -> Total SSH connections: 979
===
0 SYN_SENT
2 SYN_RECV
1 FIN_WAIT1
0 FIN_WAIT2
12 TIME_WAIT
0 CLOSED
739 CLOSE_WAIT
148 ESTABLISHED
77 LAST_ACK
0 LISTEN
0 CLOSING
0 UNKNOWN
===

看起来中的连接数出现了跳跃CLOSE_WAIT。在“正常”操作期间，中的连接数为CLOSE_WAIT或者0非常接近。

答案1

我不知道这是否是正确的解决方案，但它对我们有用。希望它至少能为其他人指明正确的方向，即使它不能完全解决问题。

我们注意到，每次发生中断时，处理器使用率都接近 100%。这又是因为另一个应用程序批量处理某些文件，占用了大部分 CPU。我们关闭了此进程，到目前为止没有发生过中断。老实说，我不知道这是否是根本原因，但它对我们有帮助。从那以后再也没有发生过一次中断。

答案2

听起来，启动隧道的客户端应用程序在完成写入操作后可能没有正确关闭连接。

答案1

答案2

相关内容