无法弄清楚 NGINX/Puma 超时 110:连接超时

无法弄清楚 NGINX/Puma 超时 110:连接超时

我的 Rails 服务器时不时会停止响应。我查看日志,什么都没有 — 只有一堆请求在最后一个请求完成后才开始启动...

使用

  • Ubuntu
  • 美洲豹 2.15.3
  • Ruby 2.2.1p85
  • NGINX 1.4.6
  • Supervisor(用于运行/重生 rails 服务器和 sidekiq)

Rails 日志...

Completed 200 OK in 179ms (Views: 6.2ms | ActiveRecord: 165.7ms)
Started OPTIONS "/api/v2/users/me" for 209.217.218.34 at 2015-12-16 21:43:47 +0000
Started OPTIONS "/api/v2/users/me" for 209.217.218.34 at 2015-12-16 21:44:21 +0000
Started GET "/" for 209.217.218.34 at 2015-12-16 21:45:11 +0000
Started GET "/" for 209.217.218.34 at 2015-12-16 21:45:27 +0000
Started GET "/" for 209.217.218.34 at 2015-12-16 21:48:13 +0000
Started OPTIONS "/api/v2/users/me" for 209.217.218.34 at 2015-12-16 21:49:19 +0000
Started GET "/" for 209.217.218.34 at 2015-12-16 22:11:58 +0000

NGINX 日志显示读取响应头时出现明显超时

2015/12/16 21:48:43 [error] 1129#0: *1777 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 209.217.218.34, server: example.com, request: "GET / HTTP/1.1", upstream: "http://unix:/example_site/shared/tmp/sockets/puma.staging.sock/", host: "example.com"
2015/12/16 21:49:49 [error] 1129#0: *1783 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 209.217.218.34, server: example.com, request: "OPTIONS /api/v2/users/me HTTP/1.1", upstream: "http://unix:/example_site/shared/tmp/sockets/puma.staging.sock/api/v2/users/me", host: "example.com", referrer: "http://example.com/"
2015/12/16 22:12:28 [error] 1129#0: *1785 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 209.217.218.34, server: example.com, request: "GET / HTTP/1.1", upstream: "http://unix:/example_site/shared/tmp/sockets/puma.staging.sock/", host: "example.com"

NGINX 服务器代理传递给 puma 服务器,该服务器有 2 个工作程序和 8 个线程。

我在其中一个线程上运行了 strace 命令。然后看到了这个...

[pid 32463] futex(0x7f0314c27a78, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000042>
[pid 32463] clock_gettime(CLOCK_MONOTONIC, {454145, 535389460}) = 0 <0.000034>
[pid 32463] write(16, "p32428\n", 7)    = 7 <0.000036>
[pid 32463] clock_gettime(CLOCK_MONOTONIC, {454145, 535541983}) = 0 <0.000013>
[pid 32463] clock_gettime(CLOCK_MONOTONIC, {454145, 535578664}) = 0 <0.000012>
[pid 32463] futex(0x7f0314c27a04, FUTEX_WAIT_BITSET_PRIVATE, 5195, {454150, 535578664}, ffffffff <unfinished ...>
[pid 32471] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000163>
[pid 32471] futex(0x7f03148dc858, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000010>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454146, 119592840}) = 0 <0.000005>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454146, 119637474}) = 0 <0.000006>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454146, 119655894}) = 0 <0.000005>
[pid 32471] futex(0x7f03148dc7e4, FUTEX_WAIT_BITSET_PRIVATE, 25969, {454147, 119655894}, ffffffff <unfinished ...>
[pid 32470] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000142>
[pid 32470] futex(0x7f0314dc94d8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000013>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454146, 531287803}) = 0 <0.000006>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454146, 531415055}) = 0 <0.000049>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454146, 531502589}) = 0 <0.000037>
[pid 32470] futex(0x7f0314dc9464, FUTEX_WAIT_BITSET_PRIVATE, 25969, {454147, 531502589}, ffffffff <unfinished ...>
[pid 32471] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000166>
[pid 32471] futex(0x7f03148dc858, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000009>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454147, 119925263}) = 0 <0.000005>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454147, 119965172}) = 0 <0.000014>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454147, 119996627}) = 0 <0.000005>
[pid 32471] futex(0x7f03148dc7e4, FUTEX_WAIT_BITSET_PRIVATE, 25971, {454148, 119996627}, ffffffff <unfinished ...>
[pid 32470] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000142>
[pid 32470] futex(0x7f0314dc94d8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000046>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454147, 531936925}) = 0 <0.000033>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454147, 532077784}) = 0 <0.000042>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454147, 532160473}) = 0 <0.000036>
[pid 32470] futex(0x7f0314dc9464, FUTEX_WAIT_BITSET_PRIVATE, 25971, {454148, 532160473}, ffffffff <unfinished ...>
[pid 32471] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000143>
[pid 32471] futex(0x7f03148dc858, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000010>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454148, 120250641}) = 0 <0.000004>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454148, 120296848}) = 0 <0.000014>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454148, 120329244}) = 0 <0.000005>
[pid 32471] futex(0x7f03148dc7e4, FUTEX_WAIT_BITSET_PRIVATE, 25973, {454149, 120329244}, ffffffff <unfinished ...>
[pid 32470] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000130>
[pid 32470] futex(0x7f0314dc94d8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000010>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454148, 532433899}) = 0 <0.000005>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454148, 532527808}) = 0 <0.000015>
[pid 32470] clock_gettime(CLOCK_MONOTONIC, {454148, 532562815}) = 0 <0.000006>
[pid 32470] futex(0x7f0314dc9464, FUTEX_WAIT_BITSET_PRIVATE, 25973, {454149, 532562815}, ffffffff <unfinished ...>
[pid 32471] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000163>
[pid 32471] futex(0x7f03148dc858, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000013>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454149, 120641924}) = 0 <0.000006>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454149, 120692457}) = 0 <0.000010>
[pid 32471] clock_gettime(CLOCK_MONOTONIC, {454149, 120715154}) = 0 <0.000005>
[pid 32471] futex(0x7f03148dc7e4, FUTEX_WAIT_BITSET_PRIVATE, 25975, {454150, 120715154}, ffffffff <unfinished ...>
[pid 32470] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out) <1.000141>
[pid 32470] futex(0x7f0314dc94d8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000009>

nginx.conf

user ubuntu;
worker_processes auto;
pid /run/nginx.pid;

env PATH;

events {
  worker_connections 768;
}

http {
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 70;
  types_hash_max_size 2048;    

  ssl_session_cache   shared:SSL:10m;
  ssl_session_timeout 10m;

  include /etc/nginx/mime.types;
  default_type application/octet-stream;

  access_log /var/log/nginx/access.log;
  error_log /var/log/nginx/error.log;

  gzip on;
  gzip_disable "msie6";

  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}

站点配置文件

upstream app {
  server unix:/example_site/shared/tmp/sockets/puma.production.sock fail_timeout=0;
}

server {
  listen 80;
  listen 443 ssl;

  sendfile off;

  server_name example.com;

  ssl_certificate /etc/nginx/ssl/example.com.crt;
  ssl_certificate_key /etc/nginx/ssl/example.com.key;
  ssl_protocols  TLSv1 TLSv1.1 TLSv1.2;  # don’t use SSLv3 ref: POODLE

  root /example_site/current/public;

  try_files $uri/index.html $uri @app;

  location @app {
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_read_timeout 30;
    proxy_connect_timeout 60;
    proxy_redirect off;
    proxy_pass http://app;
  }

  error_page 500 502 503 504 /500.html;
  client_max_body_size 4G;
  keepalive_timeout 10;
}

我对使用 strace 还很陌生,不太确定我在这里看到的是什么,以及这是否有帮助解决问题。

有人知道发生了什么事吗?我还能做些什么来弄清楚发生了什么超时?

答案1

您看到的 nginx 错误是由于 nginx 因为花费的时间太长而关闭连接,或者上游服务器没有像应该的那样接受连接,或者您的代理没有转发上游服务器进行连接所需的某些标头。

如果没有看到你的代理块和 nginx 的相关设置,这里有一些需要快速检查的事情。

您可以快速执行的操作:

  • 仔细检查 /etc/nginx/nginx.conf 中的 keepalive_timeout。
  • 在您的本地机器上为上游服务器添加一个 hosts 文件条目,并确保每个连接都完成并且不会超时。
  • 确保您将 nginx 代理上的正确标头转发回上游服务器。

以下是 nginx 代理传递指令的链接:Nginx http_proxy_module

更新:

尝试运行nginx -t,看看它是否告诉你你的配置是否有问题,因为现在我再看一遍,发现有一件事很突出,那就是尝试keepalive_timeout改变你的keepalive_timeout 70;配置keepalive_timeout 70s;

答案2

如果您的 nginx 代理和 puma 服务器位于不同的设备上,您确实需要使用 tcpdump 进行一些数据包捕获,并在 wireshark 中检查它们以找出发生了什么。如果 nginx 没有执行类似一通进行连接池,那么你可能会开始遇到问题最少 2 个并发用户 - 但实际上该配置最多只能支持 4 个。tcpdump 将通过允许您计算三次握手的次数并查看第 9 次握手是否重置来确认该问题。

长话短说 - 尝试增加工作者和线程的数量。

相关内容