我注意到 Apache 有一个非常奇怪的问题。我设置了大量虚拟主机 - 大约有 501 个。
问题在虚拟主机编号 493 之后开始出现。前 493 个虚拟主机按预期工作,但是一旦我添加虚拟主机编号 494,PHP 就会停止与 memcache 通信,并且每次读/写访问都会超时。
实际上,我正在使用 memcache 作为后端会话存储,因此,php 函数:
session_start();
30秒后超时。
如果我删除 494 个 vhost 中的随机一个并重新启动 apache,它就会再次开始工作。
我将 ulimit 设置得非常高(65k),但没用。我尝试完全关闭 ulimit,但没用。
你们知道我还能尝试什么吗?
我尝试跟踪我所连接的 httpd 进程,在浏览器中按下回车键后,等待 30 秒开始。
这是 strace 的输出:
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
因此基本上 apache 停留在 select() 上,就是这样,它无限期地重复 select() 系统调用。
我想到接下来要用到的是 tcpdump,来查看包是否真的从 apache 发出,结果确实如此:
22:11:28.366677 IP6 ::1.51404 > ::1.11914: Flags [S], seq 2899674987, win 32752, options [mss 16376,sackOK,TS val 1384759049 ecr 0,nop,wscale 9], length 0
22:11:28.366697 IP6 ::1.11914 > ::1.51404: Flags [S.], seq 2034630080, ack 2899674988, win 32728, options [mss 16376,sackOK,TS val 1384759049 ecr 1384759049,nop,wscale 9], length 0
22:11:28.366709 IP6 ::1.51404 > ::1.11914: Flags [.], ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366752 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 1:41, ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 40
22:11:28.366758 IP6 ::1.11914 > ::1.51404: Flags [.], ack 41, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366768 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 41:90, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759049], length 49
22:11:28.366772 IP6 ::1.11914 > ::1.51404: Flags [.], ack 90, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.366779 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 90:122, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 32
22:11:28.366783 IP6 ::1.11914 > ::1.51404: Flags [.], ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367063 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 1:12, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 11
22:11:28.367070 IP6 ::1.51404 > ::1.11914: Flags [.], ack 12, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367266 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 12:20, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 8
22:11:28.367275 IP6 ::1.51404 > ::1.11914: Flags [.], ack 20, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367477 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 20:25, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 5
22:11:28.367489 IP6 ::1.51404 > ::1.11914: Flags [.], ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367629 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 122:181, ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 59
22:11:28.367859 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 25:33, ack 181, win 64, options [nop,nop,TS val 1384759051 ecr 1384759050], length 8
22:11:28.367869 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 181:230, ack 33, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 49
22:11:28.368102 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 33:41, ack 230, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 8
22:11:28.368138 IP6 ::1.51404 > ::1.11914: Flags [F.], seq 230, ack 41, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368195 IP6 ::1.11914 > ::1.51404: Flags [F.], seq 41, ack 231, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368206 IP6 ::1.51404 > ::1.11914: Flags [.], ack 42, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
接下来我做的是当我向包含 session_start() 的页面发出 curl 调用时,使用 GDB 来分析 Apache 进程,这是输出:
232 *(*new)->local_addr = *sock->local_addr;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238 (*new)->local_addr->pool = connection_context;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238 (*new)->local_addr->pool = connection_context;
241 if (sock->local_addr->sa.sin.sin_family == AF_INET) {
245 else if (sock->local_addr->sa.sin.sin_family == AF_INET6) {
246 (*new)->local_addr->ipaddr_ptr = &(*new)->local_addr->sa.sin6.sin6_addr;
249 (*new)->remote_addr->port = ntohs((*new)->remote_addr->sa.sin.sin_port);
250 if (sock->local_port_unknown) {
256 if (apr_is_option_set(sock, APR_TCP_NODELAY) == 1) {
257 apr_set_option(*new, APR_TCP_NODELAY, 1);
266 if (sock->local_interface_unknown ||
267 !memcmp(sock->local_addr->ipaddr_ptr,
266 if (sock->local_interface_unknown ||
276 (*new)->local_interface_unknown = 1;
293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
292 (*new)->inherit = 0;
293 apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
296 }
unixd_accept (accepted=0x7fff14ecddf0, lr=0x7fe93a905aa8, ptrans=<value optimized out>) at /usr/src/debug/httpd-2.2.15/os/unix/unixd.c:507
507 if (status == APR_SUCCESS) {
508 *accepted = csd;
649 }
child_main (child_num_arg=<value optimized out>) at /usr/src/debug/httpd-2.2.15/server/mpm/prefork/prefork.c:650
650 SAFE_ACCEPT(accept_mutex_off()); /* unlock after "accept" */
652 if (status == APR_EGENERAL) {
656 else if (status != APR_SUCCESS) {
665 current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc);
666 if (current_conn) {
667 ap_process_connection(current_conn, csd);
在此位置有一个较长的暂停(约 30 秒),直到 php 超时。之后,我得到了以下信息:
668 ap_lingering_close(current_conn);
676 if (ap_mpm_pod_check(pod) == APR_SUCCESS) { /* selected as idle? */
680 ap_scoreboard_image->global->running_generation) { /* restart? */
679 else if (ap_my_generation !=
680 ap_scoreboard_image->global->running_generation) { /* restart? */
679 else if (ap_my_generation !=
551 while (!die_now && !shutdown_pending) {
559 apr_pool_clear(ptrans);
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
562 && requests_this_child++ >= ap_max_requests_per_child)) {
561 if ((ap_max_requests_per_child > 0
566 (void) ap_update_child_status(sbh, SERVER_READY, (request_rec *) NULL);
573 SAFE_ACCEPT(accept_mutex_on());
575 if (num_listensocks == 1) {
最奇怪的是我无法在另一台机器上重现这种情况。相同的操作系统,相同的软件包,相同的配置(puppet),相同的内核,不同的硬件。
答案1
经过几周的调试和寻找问题后,我终于偶然发现了一条消息:
You MUST recompile PHP with a larger value of FD_SETSIZE.
It is set to 1024, but you have descriptors numbered at least as high as 1073.
--enable-fd-setsize=2048 is recommended, but you may want to set it to equal
the maximum number of open files supported by your system, in order to avoid
seeing this error again at a later date.
我会尝试这个修复,但是天啊,为什么 PHP 开发人员要这样做?这太丑了,硬编码 nofile 限制是完全错误的设计。更不用说,如果这是解决方案,那么强迫我重新编译每个 PHP 小版本和安全补丁并维护我自己的软件包是一件非常麻烦的事情。
编辑:经过更广泛的调试后,似乎不仅是 PHP“设计有问题”,memcache 扩展本身也存在一系列问题。
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629896
https://bugs.php.net/bug.php?id=59876
漏洞已经存在一段时间了,但什么也没发生。我猜应该放弃 memcache 扩展并找到一个独立于它的解决方案 :-/