我们的 nginx(1.1.19 ubuntu 12.04 lts)每天都会停顿几秒(迄今为止最长的一次是 53 秒)并等待提供数据。客户端没有错误,请求只是需要更长的停顿时间。这适用于全部在此时间段内发出请求(cgi 或状态模块),并且所有请求将在停顿结束后立即得到处理。
我在服务器上有一个屏幕会话,每秒卷动一次状态页面:
{"time":"2016-09-02T10:10:21+02:00","host":"app1","data":"nginx","reading":1,"writing":4,"waiting":0}
{"time":"2016-09-02T10:10:22+02:00","host":"app1","data":"nginx","reading":1,"writing":4,"waiting":0}
{"time":"2016-09-02T10:10:50+02:00","host":"app1","data":"nginx","reading":3,"writing":9,"waiting":0}
{"time":"2016-09-02T10:11:43+02:00","host":"app1","data":"nginx","reading":5,"writing":98,"waiting":0}
{"time":"2016-09-02T10:11:44+02:00","host":"app1","data":"nginx","reading":0,"writing":25,"waiting":0}
{"time":"2016-09-02T10:11:45+02:00","host":"app1","data":"nginx","reading":3,"writing":7,"waiting":0}
间隙不是由于错误造成的,而是请求的外部持续时间比正常情况要长。访问日志中没有记录任何错误请求。您只能注意到间隙。
127.0.0.1 - - [02/Sep/2016:10:10:17 +0200] "GET /basic_status HTTP/1.1" 200 121 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
127.0.0.1 - - [02/Sep/2016:10:10:18 +0200] "GET /basic_status HTTP/1.1" 200 121 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
127.0.0.1 - - [02/Sep/2016:10:10:20 +0200] "GET /basic_status HTTP/1.1" 200 121 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
127.0.0.1 - - [02/Sep/2016:10:10:21 +0200] "GET /basic_status HTTP/1.1" 200 121 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
127.0.0.1 - - [02/Sep/2016:10:10:22 +0200] "GET /basic_status HTTP/1.1" 200 121 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
我检查了 nginx 错误日志和 fpm 日志等,但此时没有错误。
user www-data;
worker_processes 4;
pid /var/run/nginx.pid;
events {
worker_connections 768;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
large_client_header_buffers 4 80k;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
gzip on;
gzip_disable "msie6";
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
server {
listen 80;
server_name app1;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
root /var/www;
location /basic_status {
stub_status on;
}
}
我也记录了 TIME_WAIT-Quadruples 的数量,但也没有记录很高的数字:
{"time":"2016-09-02T10:10:54+02:00","host":"app1","data":"time_wait","httpFromLB":1153,"httpFromLocal":604,"mysqlToDb":1988,"memcacheToLocal":250, "cgiToLocal":1527}
{"time":"2016-09-02T10:10:55+02:00","host":"app1","data":"time_wait","httpFromLB":1153,"httpFromLocal":604,"mysqlToDb":1991,"memcacheToLocal":251, "cgiToLocal":1527}
{"time":"2016-09-02T10:10:56+02:00","host":"app1","data":"time_wait","httpFromLB":1153,"httpFromLocal":604,"mysqlToDb":1992,"memcacheToLocal":252, "cgiToLocal":1527}
{"time":"2016-09-02T10:10:57+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1628,"memcacheToLocal":213, "cgiToLocal":1236}
{"time":"2016-09-02T10:10:58+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1629,"memcacheToLocal":214, "cgiToLocal":1236}
{"time":"2016-09-02T10:10:59+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1631,"memcacheToLocal":215, "cgiToLocal":1236}
{"time":"2016-09-02T10:11:00+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1632,"memcacheToLocal":216, "cgiToLocal":1236}
{"time":"2016-09-02T10:11:01+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1633,"memcacheToLocal":217, "cgiToLocal":1236}
{"time":"2016-09-02T10:11:03+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1636,"memcacheToLocal":218, "cgiToLocal":1236}
{"time":"2016-09-02T10:11:04+02:00","host":"app1","data":"time_wait","httpFromLB":902,"httpFromLocal":496,"mysqlToDb":1637,"memcacheToLocal":219, "cgiToLocal":1236}
我将应用程序本身排除在调查之外,因为第一行代码将记录时间戳,并且它是停顿结束时的时间。
我不知道该去哪里进一步调查。有什么想法吗?