编辑1:我尝试了预生产环境中建议的设置,但没有成功。由于我可以在另一个环境中重现该问题,所以我尝试寻找相似之处。
我禁用了 varnish 的 logrotate,但没有任何效果。唯一的共同点是 cron.hourly。它是空的。
这是我在 /var/log/cron 中所拥有的内容
Feb 3 14:01:01 SPRX0032 CROND[32006]: (root) CMD (run-parts /etc/cron.hourly)
Feb 3 14:01:01 SPRX0032 run-parts(/etc/cron.hourly)[32006]: starting 0anacron
Feb 3 14:01:01 SPRX0032 run-parts(/etc/cron.hourly)[32020]: finished 0anacron
而就在同一时刻我的探测器:
03-02-14_14:01:01 - - - /!\ WARNING /!\|HTTP code 503| website1 SERVER IS DOWN booh! /!\ WARNING /!\
03-02-14_14:01:01 - - - /!\ WARNING /!\|HTTP code 503| website2 SERVER IS DOWN booh! /!\ WARNING /!\
03-02-14_14:01:01 - - - /!\ WARNING /!\|HTTP code 503| website3 SERVER IS DOWN booh! /!\ WARNING /!\
03-02-14_14:01:01 - - - /!\ WARNING /!\|HTTP code 503| website4 SERVER IS DOWN booh! /!\ WARNING /!\
原始帖子:
我在 3 个 Apache (2.2.15) 前面有一个带故障转移导向器的 2 Varnish (3.0.4) 集群,偶尔会遇到 503 错误。这种情况完全随机发生,在许多页面上,持续时间不超过一分钟。为了找到罪魁祸首,我设置了 3 个探测器,它们由一个 curl 脚本组成,每分钟获取一次页面 HTTP 状态代码。我的探测器设置为从以下位置获取页面:
- 个人联系。
- LAN 访问 Varnish。
- 局域网攻击Apache。
请求 Apache 的探测没有错误。来自外部的探测和击中 Varnish 的探测均会引发错误。
29-01-14_06:17:01 - - - |HTTP code 200||response time:0.254|website1 is UP woot!
29-01-14_06:17:02 - - - |HTTP code 200||response time:0.264|website2 is UP woot!
29-01-14_06:17:02 - - - |HTTP code 200||response time:0.477|website3 is UP woot!
29-01-14_06:17:03 - - - |HTTP code 200||response time:0.283|website4 is UP woot!
29-01-14_06:17:04 - - - |HTTP code 200||response time:0.782|website5 is UP woot!
29-01-14_06:18:28 - - - |HTTP code 200||response time:0.167|website1 is UP woot!
29-01-14_06:18:28 - - - /!\ WARNING /!\|HTTP code 503| website2 SERVER IS DOWN booh! /!\ WARNING /!\
29-01-14_06:18:28 - - - /!\ WARNING /!\|HTTP code 503| website3 SERVER IS DOWN booh! /!\ WARNING /!\
29-01-14_06:18:29 - - - /!\ WARNING /!\|HTTP code 503| website4 SERVER IS DOWN booh! /!\ WARNING /!\
29-01-14_06:18:29 - - - /!\ WARNING /!\|HTTP code 503| website5 SERVER IS DOWN booh! /!\ WARNING /!\
29-01-14_06:19:01 - - - |HTTP code 200||response time:0.243|website1 is UP woot!
29-01-14_06:19:02 - - - |HTTP code 200||response time:0.313|website2 is UP woot!
29-01-14_06:19:03 - - - |HTTP code 200||response time:0.552|website3 is UP woot!
29-01-14_06:19:03 - - - |HTTP code 200||response time:0.348|website4 is UP woot!
29-01-14_06:19:05 - - - |HTTP code 200||response time:0.704|website5 is UP woot!
这是我的后端
backend srv1 {
.host = "srv1";
.port = "80";
.first_byte_timeout = 300s;
.connect_timeout = 5s;
.between_bytes_timeout = 60s;
.probe = {
.url = "/";
.interval = 5s;
.timeout = 2s;
.window = 5;
.threshold = 3;
}
}
backend srv2{
.host = "srv2";
.port = "80";
.first_byte_timeout = 300s;
.connect_timeout = 5s;
.between_bytes_timeout = 60s;
.probe = {
.url = "/";
.interval = 5s;
.timeout = 2s;
.window = 5;
.threshold = 3;
}
}
backend srv3 {
.host = "srv3";
.port = "80";
.first_byte_timeout = 300s;
.connect_timeout = 5s;
.between_bytes_timeout = 60s;
.probe = {
.url = "/";
.interval = 5s;
.timeout = 2s;
.window = 5;
.threshold = 3;
}
}
Varnish 说我的后端在 503 时刻是健康的:
varnishadm debug.health
Backend srv1 is Healthy
Current states good: 5 threshold: 3 window: 5
Average responsetime of good probes: 0.070728
Oldest Newest
================================================================
4444444444444444444444444444444444444444444444444444444444444444 Good IPv4
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy
Backend srv2 is Healthy
Current states good: 5 threshold: 3 window: 5
Average responsetime of good probes: 0.089797
Oldest Newest
================================================================
4444444444444444444444444444444444444444444444444444444444444444 Good IPv4
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy
Backend srv3 is Healthy
Current states good: 5 threshold: 3 window: 5
Average responsetime of good probes: 0.068935
Oldest Newest
================================================================
4444444444444444444444444444444444444444444444444444444444444444 Good IPv4
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy
我运行多个网站(大约 200 个域名),没有其他问题。
Varnishstat 报告连接后端的一些错误
清漆 #1
13+23:09:28
Hitrate ratio: 3 3 3
Hitrate avg: 0.7669 0.7669 0.7669
[...]
8 0.00 0.00 backend_fail - Backend conn. failures
清漆 #2
13+05:56:44
Hitrate ratio: 4 4 4
Hitrate avg: 0.8783 0.8783 0.8783
[...]
16 0.00 0.00 backend_fail - Backend conn. failures
POST 或 GET 上可能发生错误
POST 带有 FetchError c 后端写入错误:0(成功)
23 SessionOpen c xxx 40518 :80
23 ReqStart c xxx 40518 1750308108
23 RxRequest c POST
23 RxURL c /index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&method=form&cid=20&6bc427c8a7981f4fe1f5ac65c1246b5f=cf6dd3cf1923c950586d0dd595c8e20b
23 RxProtocol c HTTP/1.1
23 RxHeader c Reverse-Via: xxxx
23 RxHeader c Host: xxxxx
23 RxHeader c Content-Type: multipart/form-data; boundary=---------------------------41184676334
23 RxHeader c User-Agent: BOT/0.1 (BOT for JCE)
23 RxHeader c Connection: Keep-Alive
23 RxHeader c Content-Length: 5000
23 VCL_call c recv pass
23 VCL_call c hash
23 Hash c /index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&method=form&cid=20&6bc427c8a7981f4fe1f5ac65c1246b5f=cf6dd3cf1923c950586d0dd595c8e20b
23 Hash c xxxx
23 VCL_return c hash
23 VCL_call c pass pass
23 Backend c 46 cluster srv2
23 FetchError c backend write error: 0 (Success)
23 VCL_call c error deliver
23 VCL_call c deliver deliver
23 TxProtocol c HTTP/1.1
23 TxStatus c 503
23 TxResponse c Service Unavailable
23 TxHeader c Server: Varnish
23 TxHeader c Content-Type: text/html; charset=utf-8
23 TxHeader c Retry-After: 5
23 TxHeader c Content-Length: 419
23 TxHeader c Accept-Ranges: bytes
23 TxHeader c Date: Tue, 28 Jan 2014 22:18:53 GMT
23 TxHeader c X-Varnish: 1750308108
23 TxHeader c Age: 0
23 TxHeader c Via: 1.1 varnish
23 TxHeader c Connection: close
23 TxHeader c X-Cache: MISS from Varnish
23 Length c 419
23 ReqEnd c 1750308108 1390947533.416723251 1390947533.822463989 0.000115156 0.405680656 0.000060081
GET 出现 FetchError c 没有后端连接
52 ReqStart c xxx 57491 1750328476
52 RxRequest c GET
52 RxURL c /filestore/retriever.flv
52 RxProtocol c HTTP/1.1
52 RxHeader c Reverse-Via: xxx
52 RxHeader c Host: xxxx
52 RxHeader c Cookie: ISAWPLB{2722D8D3-4E35-4741-92E0-22ADE0EA8C7F}={FD7F277E-829D-4B70-8BCA-9DCB3390060F}
52 RxHeader c Referer: xxxx
52 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; MAARJS; rv:11.0) like Gecko
52 RxHeader c Accept: */*
52 RxHeader c Accept-Language: nb-NO
52 RxHeader c x-flash-version: 12,0,0,38
52 RxHeader c Cache-Control: no-cache
52 RxHeader c Connection: Keep-Alive
52 VCL_call c recv lookup
52 VCL_call c hash
52 Hash c /filestore/retriever.flv
52 Hash c xxx
52 VCL_return c hash
52 VCL_call c miss fetch
52 FetchError c no backend connection
52 VCL_call c error deliver
52 VCL_call c deliver deliver
52 TxProtocol c HTTP/1.1
52 TxStatus c 503
52 TxResponse c Service Unavailable
52 TxHeader c Server: Varnish
52 TxHeader c Content-Type: text/html; charset=utf-8
52 TxHeader c Retry-After: 5
52 TxHeader c Content-Length: 419
52 TxHeader c Accept-Ranges: bytes
52 TxHeader c Date: Tue, 28 Jan 2014 23:00:53 GMT
52 TxHeader c X-Varnish: 1750328476
52 TxHeader c Age: 0
52 TxHeader c Via: 1.1 varnish
52 TxHeader c Connection: close
52 TxHeader c X-Cache: MISS from Varnish
52 Length c 419
52 ReqEnd c 1750328476 1390950053.502151966 1390950053.502464533 1.025068998 0.000263214 0.000049353
我有更多条目
FetchError c http first read error: -1 0 (Success)
FetchError c Could not get storage
FetchError c http first read error: -1 11 (Resource temporarily unavailable)
Apache 超时设置为 300 秒,保持连接被禁用。
Varnish 如何判断服务器错误,而探测器
我希望尽可能清楚。如果您需要更多信息,请告诉我。
谢谢。