PHP-FPM 套接字重用导致错误

PHP-FPM 套接字重用导致错误

我有一个在 App Engine 中运行的 Web 服务器容器,它提供 REST API。我尝试过采用相对标准的实现 - 使用 TCP 套接字的 nginx + PHP-FPM(由于某种原因,我没有得到一个可以工作的 unix 套接字)。DB 连接也是一个通过 Google Cloud VPN 运行的 TCP 套接字。

问题

我的 API 可用性约为 25%。通常,请求会504 Gateway Timeout在最大时间之后完成(App Engine 的 nginx 代理设置为 60 秒)。502 Bad Gateway如果 PHP-FPM 超时(request_terminate_timeout),有时会发生这种情况。

我正在尝试找出这是因为 App Engine nginx 配置不当、我的 nginx 还是我的 PHP-FPM 配置有问题。Nginx 应该关闭套接字或重新使用它们,但它似乎什么都没做。

客户端

当我在siege任何给定端点(25 个用户)几分钟时,我看到:

HTTP/1.1 504    60.88 secs:     176 bytes ==> GET  /path/to/rest
                       ...15 lines...
HTTP/1.1 504    61.23 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    57.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    57.68 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 504    60.76 secs:     176 bytes ==> GET  /path/to/rest
                       ...15 lines...
HTTP/1.1 504    61.06 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    33.35 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    32.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    36.61 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    39.00 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    42.47 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    48.51 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    56.45 secs:   13143 bytes ==> GET  /path/to/rest

# Another run
HTTP/1.1 200     7.65 secs:   13143 bytes ==> GET  /path/to/rest
                       ...10 lines...
HTTP/1.1 200     8.20 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     8.30 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 504    61.15 secs:     176 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.46 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.33 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.25 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    53.63 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    48.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.31 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.27 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.26 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.08 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.56 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.83 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.85 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    58.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.21 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.61 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.65 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.13 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.96 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.48 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.81 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.89 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.26 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.80 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.44 secs:     166 bytes ==> GET  /path/to/rest

仅 1 个用户也会发生这种情况:

HTTP/1.1 502    55.43 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.71 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.21 secs:     166 bytes ==> GET  /path/to/rest

服务器端

每种情况的 Nginx 日志:

# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...

# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

看起来是netstat -t这样的:

# Before starting
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 

# During the siege    
tcp        0      0 localhost:56144         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53073      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 localhost:56148         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53071      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 localhost:56106         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56191         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56113         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56150         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53102      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53051      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56146         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56117         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56179         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56160         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56168         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56170         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56111         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56115         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56123         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56109         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56113         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56140         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56181         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56121         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56191         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56119         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56142         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56106         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56110         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56144         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56148         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56150         ESTABLISHED 

# A minute or so after ending the siege
tcp        0      0 5c2ad0938ce9:53319      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34578        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34576        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34570        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56396         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56396         ESTABLISHED 

配置

nginx.conf

user  www-data;
worker_processes  auto;
worker_cpu_affinity auto;
events {
    worker_connections  512;
}
http {
    server_tokens   off;
    fastcgi_ignore_client_abort off;
    keepalive_timeout  650;
    keepalive_requests 10000;
    gzip on;
    ..more gzip settings..
    server {
        charset utf-8;
        client_max_body_size 512M;
        listen 8080;
        rewrite_log on;
        root        /app/web;
        index       index.php;
        location / {
            try_files $uri /index.php?$args;
        }
        location ~ \.php$ {
            fastcgi_pass 127.0.0.1:9000;
            include /etc/nginx/fastcgi_params;
            fastcgi_keep_conn off;
            fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
        }
    }
    include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}

php-fpm (www.conf)

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45

更多信息

禁用 keepalive 不是一个好主意,因为 App Engine 会不断轮询容器进行健康检查,这会产生大量死TIME_WAIT套接字(我试过)。

之前request_terminate_timeout有很多CLOSE_WAIT套接字而不是TIME_WAIT。设置request_terminate_timeout = 45确实在某种意义上有帮助,因为工作进程被杀死并200在重生后再次为服务。较低的终止超时只会生成更多502和更少的504s。

process_idle_timeout由于套接字从技术上来说并非处于空闲状态,因此将被忽略。

设置fastcgi_keep_conn on对 nginx 的行为没有可衡量的影响。

答案1

事实证明,问题与容器配置有关,而不是应用程序。将 MTU 设置为适合 Google 云网络的值(从 1500 降至 1430)后,查询应用程序不再存在任何问题。

这是通过将问题隔离到仅通过 Google Cloud VPN 打开套接字到数据库的请求(参见日志postgresql中的条目netstat)发现的。我们碰巧有一个 VPN 路由到第二个 VPN,其数据库连接工作正常,因为只有第一个跳跃承载了高 MTU 流量。

相关内容