Nginx 监控脚本所谓中天无法加载 nginx 测试页面(主要是在 nginx 最高负载下大约 2000 rps,用作代理),导致 zabbix 上出现“nginx 已关闭”等错误,一秒钟后,一切似乎都正常。
[NginxStatus] 2015-12-16 20:24:55,289 - ERROR: failed to load test page
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ztc/nginx/__init__.py", line 56, in _read_status
u = urllib2.urlopen(url, None, 1)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
由于它仅在最高负载(约 2000 rps)下发生,我将其与导致这种情况的一些内核参数联系起来。
这是 nginx 配置:
user nginx;
worker_processes 4;
timer_resolution 100ms;
worker_priority -15;
worker_rlimit_nofile 200000;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;
events {
worker_connections 65536;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
server_tokens off;
access_log /var/log/nginx/access.log;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
# keepalive_requests 120;
# keepalive_timeout 65;
gzip on;
gzip_http_version 1.0;
gzip_comp_level 2;
gzip_proxied any;
gzip_vary off;
gzip_types text/plain text/css application/x-javascript text/xml application/xml application/rss+xml application/atom+xml text/javascript application/javas$
ript application/json text/mathml;
gzip_min_length 1000;
gzip_disable "MSIE [1-6]\.";
variables_hash_max_size 1024;
variables_hash_bucket_size 64;
server_names_hash_bucket_size 64;
types_hash_max_size 2048;
types_hash_bucket_size 64;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
这是 sysctl.conf
net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.netfilter.nf_conntrack_max=1048576
net.nf_conntrack_max=1048576
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_tw_reuse=1
net.core.somaxconn=15000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_max_tw_buckets=720000
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_timestamps=1
net.ipv4.tcp_fin_timeout=30
netstat 输出:
netstat -an | grep -e :80 -e :443 |awk '/^tcp/ {A[$(NF)]++} END {for (I in A) {printf "%5d %s\n", A[I], I}}'
18525 TIME_WAIT
1 CLOSE_WAIT
499 FIN_WAIT1
1544 FIN_WAIT2
33311 ESTABLISHED
563 SYN_RECV
7 CLOSING
294 LAST_ACK
3 LISTEN
造成这种情况的根本原因是什么? 2000rps 的 netstat 指标是否异常? 我的 sysctl.conf 中是否存在错误,从而导致我的问题?