我有一个在 App Engine 中运行的 Web 服务器容器,它提供 REST API。我尝试过采用相对标准的实现 - 使用 TCP 套接字的 nginx + PHP-FPM(由于某种原因,我没有得到一个可以工作的 unix 套接字)。DB 连接也是一个通过 Google Cloud VPN 运行的 TCP 套接字。
问题
我的 API 可用性约为 25%。通常,请求会504 Gateway Timeout
在最大时间之后完成(App Engine 的 nginx 代理设置为 60 秒)。502 Bad Gateway
如果 PHP-FPM 超时(request_terminate_timeout
),有时会发生这种情况。
我正在尝试找出这是因为 App Engine nginx 配置不当、我的 nginx 还是我的 PHP-FPM 配置有问题。Nginx 应该关闭套接字或重新使用它们,但它似乎什么都没做。
客户端
当我在siege
任何给定端点(25 个用户)几分钟时,我看到:
HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest
# Another run
HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest
...10 lines...
HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
仅 1 个用户也会发生这种情况:
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
服务器端
每种情况的 Nginx 日志:
# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...
# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
看起来是netstat -t
这样的:
# Before starting
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
# During the siege
tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED
# A minute or so after ending the siege
tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
配置
nginx.conf
user www-data;
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 512;
}
http {
server_tokens off;
fastcgi_ignore_client_abort off;
keepalive_timeout 650;
keepalive_requests 10000;
gzip on;
..more gzip settings..
server {
charset utf-8;
client_max_body_size 512M;
listen 8080;
rewrite_log on;
root /app/web;
index index.php;
location / {
try_files $uri /index.php?$args;
}
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
include /etc/nginx/fastcgi_params;
fastcgi_keep_conn off;
fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
}
}
include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}
php-fpm (www.conf)
[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45
更多信息
禁用 keepalive 不是一个好主意,因为 App Engine 会不断轮询容器进行健康检查,这会产生大量死TIME_WAIT
套接字(我试过)。
之前request_terminate_timeout
有很多CLOSE_WAIT
套接字而不是TIME_WAIT
。设置request_terminate_timeout = 45
确实在某种意义上有帮助,因为工作进程被杀死并200
在重生后再次为服务。较低的终止超时只会生成更多502
和更少的504
s。
process_idle_timeout
由于套接字从技术上来说并非处于空闲状态,因此将被忽略。
设置fastcgi_keep_conn on
对 nginx 的行为没有可衡量的影响。
答案1
事实证明,问题与容器配置有关,而不是应用程序。将 MTU 设置为适合 Google 云网络的值(从 1500 降至 1430)后,查询应用程序不再存在任何问题。
这是通过将问题隔离到仅通过 Google Cloud VPN 打开套接字到数据库的请求(参见日志postgresql
中的条目netstat
)发现的。我们碰巧有一个 VPN 路由到第二个 VPN,其数据库连接工作正常,因为只有第一个跳跃承载了高 MTU 流量。