服务器突然停止响应,一小时后恢复

服务器突然停止响应,一小时后恢复

我的 FreeBSD 服务器已经完美运行了 2 年多,系统没有发生任何重大变化。最近我使用 Apache 的 mod_ssl 安装了 SSL 证书,运行 10 天后,服务器突然开始崩溃。

当服务器崩溃时:

  • HTTPS 和 SSH 立即失去响应
  • PING 速度减慢至数千毫秒,然后停止响应

15 至 60 分钟无法接通后:

  • 服务器突然恢复并开始全速运行 - 就像什么都没发生一样
  • 然后在 15-60 分钟内它再次崩溃并重复此循环

我检查了:

  • 当我重新启动服务器时,什么都没有改变 - 它仍然无法访问
  • CPU / RAM / HDD 使用率 - 正常(< 50%,包括高峰时段)
  • 交通没有影响 - 一天中的任何时间都会发生,包括凌晨 4 点
  • 禁用防火墙没有帮助

在 httpd-error.log 中我发现:

[notice] Digest: generating secret for digest authentication ...
[notice] Digest: done
[notice] Apache/2.2.23 (FreeBSD) mod_ssl/2.2.23 OpenSSL/0.9.8q DAV/2 configured -- resuming normal operations
[error] server reached MaxClients setting, consider raising the MaxClients setting

我尝试启用 KeepAlive 并大幅(4 倍)增加 MaxClients 大小,但这并不能解决问题:

Timeout 120
KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 1000

<IfModule mpm_prefork_module>
    StartServers          50
    MinSpareServers       128
    MaxSpareServers      1024
    ServerLimit      1024
    MaxClients          1024
    MaxRequestsPerChild   1000
</IfModule>

在第一次崩溃之前,我在 /var/log/messages 中发现:

kernel: mfi0: 228755 (454057919s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228756 (454057984s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228757 (454058049s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
kernel: mfi0: 228758 (454058114s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228759 (454058179s/0x0008/FATAL) - Battery needs replacement - SOH Bad

第一次重启后,“电池需要更换”警告消失,但 arp 消息在服务器崩溃时以大约相同的间隔不断出现在日志中:

May 23 05:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 05:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:25:90:02:08:fc on ix0
May 23 05:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 05:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 05:32:44 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 05:52:40 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 06:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 06:20:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:03 on ix0
May 23 06:20:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 06:30:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:25:90:02:08:fc on ix0
May 23 06:32:36 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 06:50:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 06:50:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 07:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 07:12:28 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 07:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 07:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0 

我下一步该做什么来发现并解决问题?

答案1

您现在应该做的最后一件事是增加 MaxClients。

这很难说。速度变慢和 MaxClients 警告表明您对服务器的需求太多,无法应付。除非您在服务器上运行大量 AJAX/COMET 内容,否则您确实应该减少 keepalive 超时(例如,最初为 2)。

“电池需要更换”不仅仅是提醒进行一些维护 - 在 BBWC 上,这意味着控制器不再尝试缓存写入 - 如果您的系统设置正确,那么您的操作系统和磁盘也不会缓存写入。

两者都表明您的系统性能非常糟糕 - 但您报告的第一件事是它似乎不可用 - 事实上您没有提到性能 - 了解如何衡量性能和捕获数据应该是您的首要任务。

我不确定为什么地址一直在移动(我假设这些是本地接口) - 这可能是其他地方的负载的结果。

这是一只生病的小狗 - 你必须开始一次解决一个问题,直到你更清楚地了解出了什么问题。

首先切换电池、调整 apache 安装并记录性能指标。

相关内容