我们有一台 Windows Server 2012 R2,用于在 IIS 上托管我们的网站。我们还有一台 Ubuntu 16.04 服务器,它运行 Nginx 1.10.3,用于将传入请求代理到我们的后端 Windows 服务器。这两台服务器都作为虚拟机在 ESXi 上运行。
我们注意到,我们的 Windows Server 有时需要很长时间才能发送 SYN-ACK 来响应传入的 SYN。
以下是 Windows 服务器上的 windump 输出内容(如您所见,仅在 63 秒和 7 个 SYN 之后,Windows 才发送相应的 SYN-ACK):
11:26:59.080471 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60011765 ecr 0,nop,wscale 7], length 0
11:27:00.075553 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60012015 ecr 0,nop,wscale 7], length 0
11:27:02.078881 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60012516 ecr 0,nop,wscale 7], length 0
11:27:06.086875 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60013518 ecr 0,nop,wscale 7], length 0
11:27:14.094838 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60015520 ecr 0,nop,wscale 7], length 0
11:27:30.126966 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60019528 ecr 0,nop,wscale 7], length 0
11:28:02.224731 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [S], seq 3338047317, win 29200, options [mss 1460,sackOK,TS val 60027552 ecr 0,nop,wscale 7], length 0
11:28:02.224789 IP 192.168.20.2.80 > 192.168.20.129.41784: Flags [S.], seq 2819099122, ack 3338047318, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 215763098 ecr 60027552], length 0
11:28:02.225363 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [.], ack 1, win 229, options [nop,nop,TS val 60027552 ecr 215763098], length 0
11:28:02.225900 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [P.], seq 1:76, ack 1, win 229, options [nop,nop,TS val 60027552 ecr 215763098], length 75: HTTP: GET /ping?id=141 HTTP/1.1[!http]
11:28:02.248577 IP 192.168.20.2.80 > 192.168.20.129.41784: Flags [FP.], seq 1:224, ack 76, win 260, options [nop,nop,TS val 215763100 ecr 60027552], length 223: HTTP: HTTP/1.1 200 OK
11:28:02.253096 IP 192.168.20.129.41784 > 192.168.20.2.80: Flags [F.], seq 76, ack 225, win 237, options [nop,nop,TS val 60027559 ecr 215763100], length 0
11:28:02.253144 IP 192.168.20.2.80 > 192.168.20.129.41784: Flags [.], ack 77, win 260, options [nop,nop,TS val 215763101 ecr 60027559], length 0
奇怪的是,如果我们改变源 IP(通过 Nginx 的 proxy_bind)或目标端口(在 IIS 中),响应时间会大大增强。
我们如何才能找出导致这种行为的原因?
更新 1:我们将 TcpTimedWaitDelay 参数更改为 30 秒,现在情况好多了,但问题仍然存在。
更新 2:以下是 netstats 报告的连接状态总数:
64 CLOSE_WAIT
1371 ESTABLISHED
1 FIN_WAIT_1
51 LISTENING
3188 TIME_WAIT