如何使用 Nginx 调试 memc/memcached 超时?

如何使用 Nginx 调试 memc/memcached 超时?

问题

Nginx 中的 memcached 后端经常发生超时(请参阅下面的示例错误)

已经测试过的内容:

  • 它同时发生在本地和远程 memcached 服务器上,因此网络不是问题所在
  • 它同时发生在梅姆科memcached后端
  • 无论有无上游保活模块
  • 它出现在 nginx 0.8、1.0.x 和 1.1.x 中
  • 缓冲日志似乎可以稍微减少错误数量,但错误仍然存​​在,这可能只是巧合
  • 通过 udp(即非阻塞)登录到另一个没有任何区别

其他信息:

  • memcached 服务器每秒处理来自此客户端和其他(远程)客户端的约 500 个请求
  • memcached 日志中没有错误/报告
  • memcached 使用以下设置启动:
    • tcp 和 udp 均已启用(目前仅使用 tcp)
    • 16GB 内存
    • 16 个线程(在 32 核机器上)
    • 限制为 8192 个连接
    • 通过非守护进程模式启动主管
  • 服务器运行 FreeBSD 8.2
  • 超时发生在低负载和高负载期间(目前每秒只有大约 50 个请求)

问题

所以我的问题是...我该如何调试这些问题?有人知道根本原因可能在哪里吗?错误不会连续发生,但频率足以每隔几分钟出现在日志中。


日志中的错误示例:

2011/08/30 17:23:34 [error] 13921#0: *38602 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38591 upstream timed out (60: Operation timed out) while reading response header from u`enter code here`pstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38601 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38918 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:35 [error] 13921#0: *38595 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:24:32 [error] 13921#0: *41768 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:24:36 [error] 13921#0: *38599 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:12 [error] 13921#0: *42489 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:30 [error] 13922#0: *39444 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:30 [error] 13922#0: *39452 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"

输出自netstat -m

# netstat -m
3404/25531/28935 mbufs in use (current/cache/total)
3318/25004/28322/229376 mbuf clusters in use (current/cache/total/max)
1161/20599 mbuf+clusters out of packet secondary zone in use (current/cache)
26/4420/4446/192000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
7591K/74070K/81661K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
4925075 requests for I/O initiated by sendfile
0 calls to protocol drain routines

答案1

从错误日志输出来看,NGiNX 似乎正在等待网络通信,试图建立新连接并从旧连接中取回数据。您提到这种情况在本地和远程都会发生,所以您说您排除了网络。但是即使是本地的仍然使用 TCP 套接字进行连接,对吗?您是否已经检查过 BSD 框上的 mbuf 是否用完了:

# netstat -m
16387/4613/21000 mbufs in use (current/cache/total)
16385/3957/20342/25600 mbuf clusters in use (current/cache/total/max)
16384/2176 mbuf+clusters out of packet secondary zone in use (current/cache)
0/403/403/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
36866K/10679K/47546K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

如果你发现 mbufs 是问题所在,那么你需要更新你的 loader.conf 并重新启动(遗憾的是不是运行时可调的),从 /boot/loader.conf:

kern.ipc.nmbclusters="128000"

如果您没有看到任何表明 total==max 的内容,那么您可能应该首先确认 nginx 不是罪魁祸首。我可能会通过 tcpduming 从 nginx 到 memcached 的连接来做到这一点,并在 nginx 声称其等待超时时验证它是否确实发送了连接设置或请求。如果成功,那么我可能会在发生错误时开始在 memcached 进程上运行 ktrace 几分钟,然后查看 kdump 以查看是否有系统调用失败(如发送或接受)。

相关内容