达到 MaxClients 数量时 Apache 超时/死锁

达到 MaxClients 数量时 Apache 超时/死锁

我在 Debian Squeeze 上运行 Apache 2.2,预制 MPM,使用 mod_php(PHP 5.3),监听多个端口(在同一台机器上运行 memcached)。在过去 6 周左右的时间里,我遇到了所有 Apache 连接在大型组中超时/死锁的情况。重新启动 Apache 可以解决问题,大约需要几小时到几天的时间。

服务器上运行的 PHP 应用程序包括面向用户的 Web UI,以及在同一台机器上单独端口上运行的后端 REST API。在http://127.0.0.1:8888/大多数页面加载时,Web UI 应用程序会通过向后端 API 发出 1 到 3 个连续的 REST 请求。这些超时错误首先出现在 PHP 的错误日志中,即 30 秒后 REST 客户端的超时错误。在 Apache 的错误日志中也可以看到相应的超时。它们会突发出现,所有请求都会超时,然后一批请求会(似乎)快速且成功地处理,然后是另一批超时。在几分钟的时间窗口内,8-25 秒内只会超时,然后可能是一秒钟的请求正在处理,然后是另一个 8-25 秒的超时。如果“请求突发”之前的“超时窗口”较短,则其后的超时窗口会更长。出现此问题时,任一应用程序中的 PHP 脚本的curl http://localhost/server-status请求curl都会超时并失败,或者有时会在 25 到 60 秒之间返回。在此期间,CPU 可用,RAM 可用,IO 似乎正常(虽然这不是我最擅长监控的一件事),并且处于SYN_RECV状态的套接字数量远低于系统somaxconn和 Apache 的数量ListenBacklog(但我确实看到套接字在积压中等待)。

在调试和研究过程中,我注意到很多事情都不合理。首先,在我上面的curl http://localhost/server-status例子中,curl连接并成功发送数据,但从未得到响应。即使经过了 45 分钟以上的时间,Apache 似乎也没有重置连接或执行任何其他操作,curl只是挂起并等待。例如:

~# curl -vv http://localhost/server-status
* About to connect() to localhost port 80 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /server-status HTTP/1.1
> User-Agent: curl/7.21.0 (x86_64-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.15 libssh2/1.2.6
> Host: localhost
> Accept: */*
>

如果我重新启动 Apache,一切就会恢复正常。

机器总共有 4GB;MemCache 使用了 768M)、4 个 CPU 核心,每个 Apache 进程的常驻内存大小在 20MB 到 30MB 之间。MaxClients已设置为100,根据我的计算和观察,它在资源限制内运行。我的第一个困难是重现问题,因为我只看到它在 6 周内实际出现过 9 次左右。凭直觉,我将 降低MaxClients10(消除资源限制)以查看是否可以触发问题,并且成功重现了 - “突发”超时模式立即开始发生。 TCP 积压填充到容量的 10% 左右,并且尝试发出上述请求永远不会返回(即使突发请求通过并完成;但我的curl请求似乎从未通过突发)。Timeout设置为500KeepAliveTimeout设置为5sys.fs.file-max/sys.fs.file-nr显示 <1% 的利用率。nofileulimit 远高于每个进程使用的 FD。感觉卡住了并且缺乏可见性,我开始strace设置我的 Apache 进程。经过仔细分析,我发现没有什么特别突出的地方,没有错误,也没有泄漏 FD。然而,在“超时期间”,我确实看到了很多:

[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)

在 cadeit 中可能会有所帮助,这里有一些strace跨所有正在运行的进程的精炼输出的“循环”:

ps auxw | grep sbin/apache | awk '{print"-p " $2}' | xargs strace -e poll,select,connect,recvfrom,sendto,flock

生成:

[17293] flock(46, LOCK_EX <unfinished ...>
[17292] flock(46, LOCK_EX <unfinished ...>
[17291] flock(46, LOCK_EX <unfinished ...>
[17290] flock(46, LOCK_EX <unfinished ...>
[17288] flock(46, LOCK_EX <unfinished ...>
[17285] flock(46, LOCK_EX <unfinished ...>
[17283] flock(46, LOCK_EX <unfinished ...>
[17276] select(0, NULL, NULL, NULL, {0, 618967} <unfinished ...>
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 964 <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] --- SIGCHLD (Child exited) @ 0 (0) ---
[17289] sendto(47, "\24\0...", 20, ...) = 20
[17289] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17289] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17289] sendto(47, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17289] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17289] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17289] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17289] poll([{fd=47, events=POLLOUT}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17289] sendto(47, "GET /<URL>"..., 390, MSG_NOSIGNAL, NULL, 0) = 390
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, [...] <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, [...] (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 963 <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, [...] (Timeout)
[17289] poll([{fd=47, [...] <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, [...] (Timeout)
[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17287] <... poll resumed> )        = 0 (Timeout)
[17287] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17287] --- SIGCHLD (Child exited) @ 0 (0) ---
[17292] <... flock resumed> )       = 0
[17292] sendto(47, "\24\0...", 20, ...) = 20
[17292] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17292] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17292] sendto(47, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17292] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17285] <... flock resumed> )       = 0
[17285] sendto(47, "\24\0...", 20, ...) = 20
[17285] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17285] sendto(47, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17285] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17285] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17285] sendto(47, "\24\0...", 20, ...) = 20
[17285] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] select(48, [47], [47], NULL, {1, 0}) = 1 (out [47], left {0, 999997})
[17285] sendto(47, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17285] select(48, [47], [], NULL, {1, 0}) = 1 (in [47], left {0, 999997})
[17285] recvfrom(47, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] poll([{fd=44, events=POLLIN}], 1, 5000 <unfinished ...>
[17292] poll([{fd=44, events=POLLIN}], 1, 5000 <unfinished ...>
[17287] <... poll resumed> )        = 1 ([{fd=44, revents=POLLIN}])
[17285] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17285] poll([{fd=47, events=POLLOUT}], 1, 1000) = 1 ([{fd=47, revents=POLLOUT}])
[17285] sendto(47, "GET /<URL>"..., 380, MSG_NOSIGNAL, NULL, 0) = 380
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17285] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17292] <... poll resumed> )        = 1 ([{fd=44, revents=POLLIN}])
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999996})
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17292] sendto(46, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "\24\0...", 20, ... <unfinished ...>
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[17287] <... sendto resumed> )      = 20
[17292] <... sendto resumed> )      = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999996})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 22479
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17292] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17292] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17292] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17292] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17292] flock(46, LOCK_EX <unfinished ...>
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] sendto(46, "\24\0...", 20, ...) = 20
[17287] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17287] poll([{fd=46, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1000) = 1 ([{fd=46, revents=POLLOUT}])
[17287] select(47, [46], [46], NULL, {1, 0}) = 1 (out [46], left {0, 999997})
[17287] sendto(46, "<MC>"..., 35, MSG_NOSIGNAL, NULL, 0) = 35
[17287] select(47, [46], [], NULL, {1, 0}) = 1 (in [46], left {0, 999997})
[17287] recvfrom(46, "<MC>"..., 32768, 0, NULL, NULL) = 7280
[17287] flock(46, LOCK_EX <unfinished ...>
[17289] <... poll resumed> )        = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17289] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 1000 <unfinished ...>
[17276] <... select resumed> )      = 0 (Timeout)
[17276] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[17284] <... poll resumed> )        = 0 (Timeout)
[17284] poll([{fd=47, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
[17284] poll([{fd=47, [...] <unfinished ...>
...

不确定poll()超时是否与此有关。

为了进一步生成流量模式,并指出与SIGCHLD通知的奇怪关联,这里strace 仅使用connect(), 。为了清楚起见,在 5 秒内没有看到任何输出后,我按了几次回车键(尽管大多数输出​​都是连续爆发的(尽管有时在和其余输出recvfrom()之间有几秒钟))。GET /...

[17289] sendto(47, "GET /[BACKEND CALL]"..., 390, MSG_NOSIGNAL, NULL, 0) = 390


[17283] --- SIGCHLD (Child exited) @ 0 (0) ---
[17292] sendto(47, "\24\0...", 20, ...) = 20
[17292] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(47, "[MC]"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
[17292] recvfrom(47, "[MC]"..., 32768, 0, NULL, NULL) = 22479
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(47, "[MC]"..., 37, ...) = 37
[17288] recvfrom(47, "[MC]"..., ...) = 22479
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(47, "[MC]"..., 35, ...) = 35
[17288] recvfrom(47, "[MC]"..., ...) = 7280
[17288] sendto(47, "\24\0...", 20, ...) = 20
[17288] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17288] sendto(47, "GET /[BACKEND CALL]"..., 380, ...) = 380
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 37, ...) = 37
[17292] recvfrom(46, "[MC]"..., ...) = 22479
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 37, ...) = 37
[17283] recvfrom(46, "[MC]"..., ...) = 22479
[17292] sendto(46, "\24\0...", 20, ...) = 20
[17292] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17292] sendto(46, "[MC]"..., 35, ...) = 35
[17292] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280
[17283] sendto(46, "\24\0...", 20, ...) = 20
[17283] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17283] sendto(46, "[MC]"..., 35, ...) = 35
[17283] recvfrom(46, "[MC]"..., ...) = 7280


<Long Wait>


[17284] --- SIGCHLD (Child exited) @ 0 (0) ---
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 37, ...) = 37
[17284] recvfrom(46, "[MC]"..., ...) = 22479
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(46, "\24\0...", 20, ...) = 20
[17284] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(46, "[MC]"..., 35, ...) = 35
[17284] recvfrom(46, "[MC]"..., ...) = 7280
[17284] sendto(47, "\24\0...", 20, ...) = 20
[17284] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(47, "[MC]"..., 37, ...) = 37
[17284] recvfrom(47, "[MC]"..., ...) = 22479
[17284] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17284] sendto(47, "POST /[BACKEND CALL]"..., 412, ...) = 412


<Long Wait>


[17289] --- SIGCHLD (Child exited) @ 0 (0) ---
[17293] sendto(47, "\24\0...", 20, ...) = 20
[17293] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(47, "[MC]"..., 37, ...) = 37
[17293] recvfrom(47, "[MC]"..., ...) = 22479
[17290] sendto(47, "\24\0...", 20, ...) = 20
[17290] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17290] sendto(47, "[MC]"..., 37, ...) = 37
[17290] recvfrom(47, "[MC]"..., ...) = 22479
[17290] sendto(47, "\24\0...", 20, ...) = 20
[17290] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17290] sendto(47, "[MC]"..., 35, ...) = 35
[17290] recvfrom(47, "[MC]"..., ...) = 7280
[17290] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17290] sendto(47, "POST /[BACKEND CALL]"..., 427, ...) = 427
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 37, ...) = 37
[17289] recvfrom(46, "[MC]"..., ...) = 22479
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17289] sendto(46, "\24\0...", 20, ...) = 20
[17289] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17289] sendto(46, "[MC]"..., 35, ...) = 35
[17289] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 37, ...) = 37
[17293] recvfrom(46, "[MC]"..., ...) = 22479
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280
[17293] sendto(46, "\24\0...", 20, ...) = 20
[17293] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17293] sendto(46, "[MC]"..., 35, ...) = 35
[17293] recvfrom(46, "[MC]"..., ...) = 7280


<Long Wait>


[17288] --- SIGCHLD (Child exited) @ 0 (0) ---
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17291] sendto(47, "[MC]"..., 37, ...) = 37
[17291] recvfrom(47, "[MC]"..., ...) = 22479
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17291] sendto(47, "[MC]"..., 35, ...) = 35
[17291] recvfrom(47, "[MC]"..., ...) = 7280
[17291] sendto(47, "\24\0...", 20, ...) = 20
[17291] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17291] sendto(47, "GET /[BACKEND CALL]"..., 380, ...) = 380
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 37, ...) = 37
[17288] recvfrom(46, "[MC]"..., ...) = 22479
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280
[17288] sendto(46, "\24\0...", 20, ...) = 20
[17288] connect(46, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17288] sendto(46, "[MC]"..., 35, ...) = 35
[17288] recvfrom(46, "[MC]"..., ...) = 7280


<Long Wait>


[17284] --- SIGCHLD (Child exited) @ 0 (0) ---
[17284] sendto(47, "\24\0...", 20, ...) = 20
[17284] connect(47, ...11211...127.0.0.1 = -1 EINPROGRESS (...)
[17284] sendto(47, "[MC]"..., 35, ...) = 35
[17284] recvfrom(47, "[MC]"..., ...) = 7280
[17284] connect(47, ...(8888)...("127.0.0.1")...) = -1 EINPROGRESS (...)
[17284] sendto(47, "GET /[BACKEND CALL]"..., 347, ...) = 347


<Long Wait>
...

请注意,这些请求中没有一个是“Web UI 请求”。有些请求直接到达后端,因此这并不一定意味着前端请求永远不会通过。我发现这个帖子这听起来与我遇到的问题非常相似,但解决方案(不在请求中进行 HTTP 调用)实际上并不适用。长时间观察上述输出时,每 100 多个“后端”请求中我只会看到一个“Web UI”请求,因此我没有感觉到这是在等待后端调用返回时发生的死锁,但只有 10 个MaxClients,可能至少有 1 个在等待(不确定;无法看到/server-status此问题何时出现)。

我现在很困惑;以前从未见过这样的事情。上面的信息可能不足以正确回答这个问题,所以如果有人能想到任何其他输出或测试来帮助调试这个问题,请告诉我,我会进行编辑。任何指导都将不胜感激。

相关内容