Nginx 作为负载均衡器。连接到上游时频繁出现上游超时（110：连接超时）

2024-5-30 • tag-icon

我正在尝试在 centos 7 虚拟机上使用 nginx 作为负载均衡器来替换老化的 Coyote Point 硬件设备。但是在我们的一个 web 应用程序中，我们在日志中看到频繁且持续的上游超时错误，并且客户端在使用系统时报告了会话问题。

以下是 nginx.conf 中的相关内容

user  nginx;
worker_processes  4;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}

upstream farm {
   ip_hash;

   server www1.domain.com:8080;
   server www2.domain.com:8080 down;
   server www3.domain.com:8080;
   server www4.domain.com:8080;
}

server {
        listen 192.168.1.87:80;
        server_name host.domain.com;
        return         301 https://$server_name$request_uri;
}

server {
    listen 192.168.1.87:443 ssl;
    server_name host.domain.com;

    ## Compression
    gzip              on;
    gzip_buffers      16 8k;
    gzip_comp_level   4;
    gzip_http_version 1.0;
    gzip_min_length   1280;
    gzip_types        text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript image/x-icon image/bmp;
    gzip_vary         on;

    tcp_nodelay on;
    tcp_nopush on;
    sendfile off;

    location / {
           proxy_connect_timeout   10;
           proxy_send_timeout      180;
           proxy_read_timeout 180; #to allow for large managers reports
           proxy_buffering off;
           proxy_buffer_size   128k;
           proxy_buffers   4 256k;
           proxy_busy_buffers_size   256k;
           proxy_set_header Host $host;
           proxy_set_header X-Real-IP $remote_addr;
           proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
           proxy_pass http://farm;

           location ~* \.(css|jpg|gif|ico|js)$ {
                        proxy_cache mypms_cache;
                add_header X-Proxy-Cache $upstream_cache_status;
                        proxy_cache_valid 200 60m;
                        expires 60m;
                        proxy_pass http://farm;
                 }
 }

 location /basic_status {
    stub_status;
    }

    error_page 502 502 = /maintenance.html;
    location = /maintenance.html {
    root /www/;
 }
}

在日志中我经常看到类似这样的条目

2015/03/13 15:22:58 [error] 4482#0: *557390 upstream timed out (110: Connection timed out) while connecting to upstream, client: 72.160.92.101, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
2015/03/13 15:23:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0"
2015/03/13 15:23:19 [error] 4481#0: *557550 upstream timed out (110: Connection timed out) while connecting to upstream, client: 50.134.133.213, server: host.domain.com, request: "GET /tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1 HTTP/1.1", upstream: "http://192.168.1.50:8080/tbltapechart.php?numNights=30&startDate=1-Aug-2015&roomTypeID=-1&hideNav=N&bookingID=&roomFilter=-1", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
2015/03/13 15:23:37 [error] 4483#0: *561705 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /js/multiselect/jquery.multiselect.filter.css HTTP/1.1", upstream: "http://192.168.1.55:8080/js/multiselect/jquery.multiselect.filter.css", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php"
2015/03/13 15:23:40 [error] 4481#0: *561099 upstream timed out (110: Connection timed out) while connecting to upstream, client: 74.223.167.14, server: host.domain.com, request: "GET /img/tabs_left_bc.jpg HTTP/1.1", upstream: "http://192.168.1.55:8080/img/tabs_left_bc.jpg", host: "host.domain.com", referrer: "https://host.domain.com/fdhome.php"
2015/03/13 15:23:45 [error] 4481#0: *557214 upstream timed out (110: Connection timed out) while connecting to upstream, client: 75.37.141.182, server: host.domain.com, request: "GET /tapechart.php HTTP/1.1", upstream: "http://192.168.1.50:8080/tapechart.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
2015/03/13 15:23:52 [error] 4482#0: *557330 upstream timed out (110: Connection timed out) while connecting to upstream, client: 173.164.149.18, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=658108460B&show=1&toFolioID=3361434"
2015/03/13 15:24:14 [error] 4481#0: *557663 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=3105543&show=0 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=3105543&show=0", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3105543&show=0"
2015/03/13 15:24:15 [error] 4481#0: *557752 upstream timed out (110: Connection timed out) while connecting to upstream, client: 24.158.4.70, server: host.domain.com, request: "GET /bkgtabs.php?bookingID=2070569 HTTP/1.1", upstream: "http://192.168.1.50:8080/bkgtabs.php?bookingID=2070569", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
2015/03/13 15:24:15 [error] 4482#0: *558613 upstream timed out (110: Connection timed out) while connecting to upstream, client: 199.102.121.3, server: host.domain.com, request: "GET /rptlanding.php HTTP/1.1", upstream: "http://192.168.1.50:8080/rptlanding.php", host: "host.domain.com", referrer: "https://host.domain.com/tapechart.php"
2015/03/13 15:24:17 [error] 4482#0: *557353 upstream timed out (110: Connection timed out) while connecting to upstream, client: 174.53.144.4, server: host.domain.com, request: "GET /js/multiselect/demo/assets/prettify.js HTTP/1.1", upstream: "http://192.168.1.50:8080/js/multiselect/demo/assets/prettify.js", host: "host.domain.com", referrer: "https://host.domain.com/bkgtabs.php?bookingID=3186044"

我最初发现我必须设置如此高的 proxy_read_timeout，因为我们有一份非常大的报告，对于拥有中等数据集的用户，至少需要 20 秒才能完全呈现。拥有最大数据集的用户可能需要长达 2 分钟才能呈现报告。但是它很少运行，通常每天使用不到一次，并且从未成为日志中 GET 字符串中的 URL。

四台后端服务器都是相同的 Apache 服务器，均运行从源代码构建的 httpd 2.2.29 和 php 5.5.22，并且都使用相同版本的 centos 并且是最新版本。由于我最初在日志中看到 MaxClients 命中，因此我在每个 Apache 主机上定义了以下内容

<IfModule mpm_prefork_module>
    StartServers          10
    MinSpareServers       10
    MaxSpareServers      20
    MaxClients          200
    MaxRequestsPerChild   300
</IfModule>

nginx 服务器和 apache 服务器都位于同一个数据中心、同一个子网和 vlan，但是我在 apache 服务器端的 error_log 中没有看到任何内容表明超时的原因。

我们尝试解决此问题的其他方法包括

将 proxy_read_timeout 增加到 300。
删除 Gzip 设置。
删除 css、图像和 javascript 缓存的位置块。
启用 proxy_buffering。由于报告较大，因此禁用该功能，以允许 nginx 开始以渲染形式提供报告（包括构建报告 javascript 进度指示器），而不是显示空白页 20 - 120 秒。
向上游添加 KeepAlive 8 / 16 / 32 / 64。

此时我怀疑这是一个网络问题或后端问题，因为我已将 web 应用程序移回 coyote point 负载均衡器并且投诉已减少。

我真的很想弄清楚这一点，但我不知道该怎么做。请给我一些建议？

答案1

我在 nginx<->apache2 设置中遇到了类似的事情。这是因为 MySQL 陷入困境，导致 apache 在负载下花费了太长时间。为了找出 apache 花费了多长时间，我将日志格式更改为：

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %DµSEC" timed

并且 nginx 日志记录到：

log_format timed_combined '$remote_addr - $remote_user [$time_local]  '

然后就更容易看出，虽然 apache 完成了所有请求，但是将数据传回 nginx 却非常晚（晚了好几秒）。

我不确定 haproxy 为何能解决您的问题，除非某个 Apache 服务器比其他服务器慢得多。当一台机器出现可恢复磁盘错误时，同一台机器上也会出现这种情况。错误应该显示在系统日志中。