我已将 varnish 配置为监听端口 80,将 nginx 配置为监听端口 8080。在正常运行约 24 小时后,我的网站已停机 22 小时。我检查后发现 varnish 没有监听端口 80。
网站启动时:
abc@abc:~$ sudo netstat -anp --tcp --udp | grep LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 571/varnishd
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 376/nginx
tcp 0 0 0.0.0.0:9171 0.0.0.0:* LISTEN 376/nginx
tcp 0 0 publicip:6082 0.0.0.0:* LISTEN 569/varnishd
tcp6 0 0 :::80 :::* LISTEN 376/nginx
tcp6 0 0 ::1:6082 :::* LISTEN 569/varnishd
当网站瘫痪时:
abc@abc:~$ sudo netstat -anp --tcp --udp | grep LISTEN
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 376/nginx
tcp 0 0 0.0.0.0:9171 0.0.0.0:* LISTEN 376/nginx
tcp 0 0 publicip:6082 0.0.0.0:* LISTEN 745/varnishd
tcp6 0 0 :::80 :::* LISTEN 376/nginx
tcp6 0 0 ::1:6082 :::* LISTEN 745/varnishd
这是我的 /etc/default/varnish:
## Alternative 2, Configuration with VCL
#
# Listen on port 6081, administration on localhost:6082, and forward to
# one content server selected by the vcl file, based on the request. Use a 1GB
# fixed-size cache file.
#
DAEMON_OPTS="-a :80 \
-T localhost:6082 \
-f /etc/varnish/default.vcl \
-S /etc/varnish/secret \
-s malloc,96m"
在第二种情况下,Varnish 没有监听 80 端口,这有什么具体原因吗?我可能只需要检查一下,如果 Varnish 没有启动,就重新启动它,但这仍然意味着几分钟的停机时间。
我的 varnish.vcl 文件:http://pastebin.com/UH2c8KdH 我在 ubuntu 12.04 x86 上
大约 2 小时后它再次发生,这是我从系统日志中发现的。
Feb 14 18:16:00 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:16:51 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:17:49 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:18:06 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:19:33 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:21:25 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:22:34 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:28:28 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:29:41 abc varnishd[745]: Child (749) not responding to CLI, killing it.
Feb 14 18:29:48 abc last message repeated 2 times
Feb 14 18:29:48 abc varnishd[745]: Child (749) died signal=3
Feb 14 18:29:49 abc varnishd[745]: Child cleanup complete
Feb 14 18:29:55 abc varnishd[745]: child (1380) Started
Feb 14 18:29:58 abc varnishd[745]: Pushing vcls failed: CLI communication error (hdr)
Feb 14 18:29:58 abc varnishd[745]: Stopping Child
Feb 14 18:29:58 abc varnishd[745]: Child (1380) said Child starts
Feb 14 18:29:59 abc varnishd[745]: Child (1380) said Child dies
Feb 14 18:30:02 abc varnishd[745]: Child (1380) died status=1
Feb 14 18:30:04 abc varnishd[745]: Child cleanup complete
我不确定为什么进程 ID 与我之前发布的不同。也许我在故障排除时重新启动了它。我真的无法从这些日志中看出太多信息。任何帮助都非常感谢。
添加更多日志:
详情来自/etc/log/messages
:
第一次停止:
Feb 13 17:40:44 dragon75 varnishd[581]: Child (583) died signal=3
Feb 13 17:41:09 dragon75 varnishd[581]: child (2682) Started
Feb 13 17:42:31 dragon75 varnishd[581]: Child (2682) said Child starts
Feb 13 17:51:48 dragon75 varnishd[581]: Child (2682) died status=1
Feb 13 17:51:48 dragon75 varnishd[581]: Child (-1) said Child dies
第二次停止:
Feb 14 18:29:48 dragon75 varnishd[745]: Child (749) died signal=3
Feb 14 18:29:55 dragon75 varnishd[745]: child (1380) Started
Feb 14 18:29:58 dragon75 varnishd[745]: Child (1380) said Child starts
Feb 14 18:29:59 dragon75 varnishd[745]: Child (1380) said Child dies
Feb 14 18:30:02 dragon75 varnishd[745]: Child (1380) died status=1
根据消息,16:31 varnish 启动,然后 /var/log/messages 中有 5 条 MARK 消息,18:29 varnish child died 消息。中间什么都没有。
我不认为资源是瓶颈。这是一个新网站,仍处于测试阶段。我还没有真正在上面放任何东西。除了我在另一台服务器上的 uptime 脚本(它只检查主页)外,没有流量。这是我第一次使用 varnish。
答案1
将Varnish中的cli_timeout参数增加到60秒。
这控制监控父进程等待子进程响应健康检查的时间。如果操作系统正忙于将数据分页到磁盘或从磁盘分页,则 10 秒的默认值可能太低。将其增加到 1 分钟(从 4.0 开始为默认值),看看问题是否消失。
如果这没有帮助,我的下一个猜测是过于急切的日志轮换脚本会杀死比它们应该杀死的更多的人。