简要介绍一下背景:我们不断收到用户关于断断续续的连接问题的报告。每天有好几次,用户会因为连接失败或 SSL 握手问题(我认为这是由于连接问题引起的)而不得不重新加载他们所在的页面。这种情况发生得太快了,以至于我无法收集这些事件期间的任何数据。它往往会自行消失,但稍后又会再次出现,通常是在流量高峰时段。
关于我们的设置:我们在循环 DNS 中有三个虚拟 IP,由 keepalived 为我们的应用服务器池管理。nginx 正在接受 SSL 连接,这些连接会上游到 haproxy 以分发到其他应用服务器。自从这些问题出现以来,我已经更新了服务器上的所有软件(包括从 CentOS5 升级到 CentOS6),但这并没有帮助。我已经在这里发布了关于我们的 nginx 配置的信息,看起来没什么问题。它主要基于 Mozilla 的 nginx 配置生成器,用于 SSL 最佳实践。
有人建议我注意 TCP 统计计数器。但我不太清楚如何解释这些计数器。这是我netstat -s
昨天重新启动的应用服务器上的输出(因此推测计数器昨天为 0):
Ip:
1021579809 total packets received
4875 forwarded
0 incoming packets discarded
1021562810 incoming packets delivered
1033056732 requests sent out
1 outgoing packets dropped
76648 dropped because of missing route
2 fragments dropped after timeout
7072 reassemblies required
2020 packets reassembled ok
2 packet reassembles failed
1514 fragments received ok
6056 fragments created
Icmp:
20522423 ICMP messages received
533 input ICMP message failed.
ICMP input histogram:
destination unreachable: 20503410
timeout in transit: 2013
wrong parameters: 2
source quenches: 10
redirects: 8264
echo requests: 8256
echo replies: 2
20497056 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 20488798
time exceeded: 1
echo request: 1
echo replies: 8256
IcmpMsg:
InType0: 2
InType3: 20503410
InType4: 10
InType5: 8264
InType8: 8256
InType11: 2013
InType12: 2
OutType0: 8256
OutType3: 20488798
OutType8: 1
OutType11: 1
Tcp:
46263582 active connections openings
30767670 passive connection openings
104167 failed connection attempts
2769710 connection resets received
104167 failed connection attempts
2769710 connection resets received
6428 connections established
979651572 segments received
989059642 segments send out
2386512 segments retransmited
1454 bad segments received.
4277435 resets sent
Udp:
32926 packets received
21204463 packets to unknown port received.
0 packet receive errors
21033739 packets sent
UdpLite:
TcpExt:
624791 invalid SYN cookies received
96083 resets received for embryonic SYN_RECV sockets
367 packets pruned from receive queue because of socket buffer overrun
54 ICMP packets dropped because they were out-of-window
21204114 TCP sockets finished time wait in fast timer
57674 packets rejects in established connections because of timestamp
38714053 delayed acks sent
12521 delayed acks further delayed because of locked socket
Quick ack mode was activated 6563499 times
62 times the listen queue of a socket overflowed
62 SYNs to LISTEN sockets ignored
74285057 packets directly queued to recvmsg prequeue.
554544554 packets directly received from backlog
34503032789 packets directly received from prequeue
336811743 packets header predicted
75957393 packets header predicted and directly queued to user
210355614 acknowledgments not containing data received
318977957 predicted acknowledgments
1663 times recovered from packet loss due to fast retransmit
181338 times recovered from packet loss due to SACK data
898 bad SACKs received
Detected reordering 1847 times using FACK
Detected reordering 3512 times using SACK
Detected reordering 40 times using reno fast retransmit
Detected reordering 16201 times using time stamp
46565 congestion windows fully recovered
49940 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 196240
204108 congestion windows recovered after partial ack
63640 TCP data loss events
TCPLostRetransmit: 4150
747 timeouts after reno fast retransmit
40359 timeouts after SACK recovery
24399 timeouts in loss state
286482 fast retransmits
71966 forward retransmits
317608 retransmits in slow start
802284 other TCP timeouts
TCPRenoRecoveryFail: 324
14820 sack retransmits failed
22966 packets collapsed in receive queue due to low socket buffer
6453991 DSACKs sent for old packets
1781 DSACKs sent for out of order packets
649408 DSACKs received
3047 DSACKs for out of order packets received
1733842 connections reset due to unexpected data
100890 connections reset due to early user close
98451 connections aborted due to timeout
TCPSACKDiscard: 446
TCPDSACKIgnoredOld: 4660
TCPDSACKIgnoredNoUndo: 161136
TCPSpuriousRTOs: 15474
TCPSackShifted: 296768
TCPSackMerged: 495277
TCPSackShiftFallback: 944017
TCPChallengeACK: 82577
TCPSYNChallenge: 287
TCPFromZeroWindowAdv: 5279
TCPToZeroWindowAdv: 5279
TCPWantZeroWindowAdv: 35932
IpExt:
InMcastPkts: 151429
OutMcastPkts: 151501
InOctets: 947588463980
OutOctets: 692622505019
InMcastOctets: 6360064
OutMcastOctets: 6060040
此外netstat -s
,还有其他工具可以用来更好地了解这里发生的事情吗?
我最近做了一些更改(这些问题出现后)。其他参数设置为默认值,因为我没有更改它们。
/proc/sys/net/ipv4/tcp_max_syn_backlog: 4096
/proc/sys/net/ipv4/tcp_fin_timeout: 30
/proc/sys/net/ipv4/ip_local_port_range: 15000 65000
/proc/sys/net/netfilter/nf_conntrack_max: 500000
使用 Cacti 的相关图表进行了更新: