1 或 2 个 nagios 实例检测到的零星网络超时毫无意义

1 或 2 个 nagios 实例检测到的零星网络超时毫无意义

我在一家小公司担任 Linux 和 DB 系统管理员。由于这更像是商品硬件,所以我想先尝试超级用户。我们偶尔会遇到奇怪的网络问题。每次我试图找出问题所在时,总会有一些东西否定它。

大多数情况下,我们在 Nagios 中收到的超时时间超过 10 秒,但与问题无关。它们是间歇性的,似乎是随机的,但是在某些机器/互联网连接上,这种情况会持续发生。我在第二台开发服务器上安装了 Nagios 的第二份副本。这两台机器都是我安装的,使用并不频繁。CentOS 最小安装,大多数软件包 yum 根据需要安装。

有时两个 Nagios 都会检测到问题,有时只有主 Nagios 会检测到问题。有些检查是每 5 分钟一次,有些是每分钟一次,所以我知道有重叠,但这并不合理。

我们已经重启了机器和网络设备。我们有 4 个不同的互联网连接,都是通过 Nagios 连接到企业 FIOS 的。一个问题是 Nagios 与真实数据中心机架上的机器之间的连接。那里的机器在超时期间没有报告或出现任何网络问题,只有我们的 Nagios 检测到访问它们时出现问题。幸运的是,这些机器现在到达了我们的其他位置,但我们希望它们能够到达 Nagios 所在的位置以进行灾难恢复。

另一个地方有 FIOS 和康卡斯特电缆,有时 Nagios 会使它们超时。最后,它通过消费者康卡斯特电缆访问我家里的服务器,偶尔也会超时,我可以在中断期间检查连接情况。

所以,从头开始。我在 FIOS 上的位置 A 运行 2 个 Nagios 实例,对其他 3 个位置的服务器进行相同的检查。有时两个 Nagios 都出现超时,因此这些似乎是“真实”问题,其他时候只有一个 Nagios 开始出现故障,这毫无道理,因为两个服务器在物理上和网络上都彼此相邻。我刚刚再次重新启动了主 Nagios。

我将等待下一个问题并报告两个 Nagios 检测到的问题。我应该寻找什么来排除故障?我查看了所有日志,在我们的路由器上设置了日志记录,在问题发生时检查了网络,但无法找出问题所在。

感谢您的帮助!


最新更新

我发现了与 ssh 攻击或任何其他服务器端问题无关的问题之一。我们遇到了 5 分钟的中断,我无法从 Nagios 位置访问某些位置。Ping 停止了,但我能够从其他位置 ping 这些位置/服务器。我在中断期间和之后进行了跟踪,在 5 分钟后,这种情况偶尔会反复发生,我发现它停止在

   G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)

有什么想法或帮助吗?这是问题发生期间的跟踪路径,我们的东西上的名称发生了变化:

    [root@nagiosServer ~]# tracepath datacenterServer
     1:  ourdomain.com (192.168.1.55)                               0.076ms pmtu 1500
     1:  ourRouter (192.168.1.1)                                  0.297ms
     1:  ourRouter (192.168.1.1)                                  0.258ms
     2:  L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1)    4.817ms asymm  3
     3:  G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)   5.696ms
     4:  no reply
     5:  no reply
     6:  no reply
     7:  no reply
     8:  no reply
     9:  no reply
    10:  no reply
    11:  no reply
    12:  no reply
    13:  no reply
    14:  no reply
    15:  no reply
    16:  no reply
    17:  no reply
    18:  no reply
    19:  no reply
    20:  no reply
    21:  no reply
    22:  no reply
    23:  no reply
    24:  no reply
    25:  no reply
    26:  no reply
    27:  no reply
    28:  no reply
    29:  no reply
    30:  no reply
    31:  no reply
         Too many hops: pmtu 1500
         Resume: pmtu 1500
    [root@nagiosServer ~]# date
    Fri Apr  4 12:04:30 EDT 2014
    [root@nagiosServer ~]#

问题解决后:

    [root@nagiosServer ~]# date
    Fri Apr  4 12:04:51 EDT 2014
    [root@nagiosServer ~]# tracepath datacenterServer
     1:  ourdomain.com (192.168.1.55)                               0.081ms pmtu 1500
     1:  ourRouter (192.168.1.1)                                  0.253ms
     1:  ourRouter (192.168.1.1)                                  0.295ms
     2:  L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1)    2.631ms asymm  3
     3:  G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)   6.390ms
     4:  so-3-1-0-0.PHIL-BB-RTR2.verizon-gni.net (130.81.22.60)  20.953ms asymm  5
     5:  0.xe-2-1-0.BR2.IAD8.ALTER.NET (152.63.5.245)          13.855ms asymm  7
     6:  ae-20.r04.asbnva02.us.bb.gin.ntt.net (129.250.8.33)   13.123ms asymm  5
     7:  ge-100-0-0-20.r04.asbnva02.us.ce.gin.ntt.net (168.143.97.190)  14.057ms
     8:  core1-ten-2-1.nwrk1.hostmysite.net (67.59.145.33)     12.873ms asymm 15
     9:  ae5-dist1.nwk01.hosting.com (67.59.145.89)            12.912ms asymm 15
    10:  no reply
    11:  no reply
    12:  no reply
    13:  no reply
    14:  no reply
    15:  no reply
    16:  no reply
    17:  no reply
    18:  no reply
    19:  no reply
    20:  no reply
    21:  no reply
    22:  no reply
    23:  no reply
    24:  no reply
    25:  no reply
    26:  no reply
    27:  no reply
    28:  no reply
    29:  no reply
    30:  no reply
    31:  no reply
         Too many hops: pmtu 1500
         Resume: pmtu 1500
    [root@nagiosServer ~]#

更新:Nagios1 所在的服务器在 3 小时内启动和关闭(SSH 检查失败),我发现这是一个非常奇怪的问题,因为 Nagios 本身正在检查机器本身是否正常运行。我检查了日志,发现安全日志中恰好有一次破解尝试,IP 来自中国。显然,我们需要锁定 ssh 并且不要暴露 root,但我猜如果 ssh 一次最多建立 10 个连接,那么这个数字就超过了这个数字,从而防止在此期间偶尔出现 Nagios ssh 检查失败。希望在锁定 ssh/防火墙/或使用密钥后,这将消除大多数超时问题。

    ...
    Host started flapping[03-22-2014 11:44:43] HOST FLAPPING ALERT: Nagios1;STARTED; Host appears to have started flapping (23.7% change > 20.0% threshold)
    Host Up[03-22-2014 11:44:43] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-22-2014 11:44:23] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
    Host Down[03-22-2014 11:43:33] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:
    Host Up[03-22-2014 11:43:23] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-22-2014 11:42:53] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
    Host Down[03-22-2014 11:42:23] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:

安全日志:

    ...
    Mar 22 11:43:02 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11941]: Failed password for root from 59.63.167.224 port 58905 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11942]: Disconnecting: Too many authentication failures for root
    Mar 22 11:43:02 Nagios1 sshd[11941]: PAM 5 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=59.63.167.224  user=root
    Mar 22 11:43:02 Nagios1 sshd[11941]: PAM service(sshd) ignoring max retries; 6 > 3
    Mar 22 11:43:02 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
    Mar 22 11:43:02 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11960]: Failed password for root from 59.63.167.224 port 59716 ssh2
    Mar 22 11:43:03 Nagios1 sshd[11943]: Failed password for root from 59.63.167.224 port 59237 ssh2
    Mar 22 11:43:03 Nagios1 sshd[11988]: Failed password for root from 59.63.167.224 port 60277 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12001]: Failed password for root from 59.63.167.224 port 60672 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2

更新:只有一次新的超时,但这次是通过 FIOS 连接而不是电缆连接到达 DB 主机的。Nagios1 检测到了它,但 Nagios2 没有,它很短暂,所以它可能错过了。

Nagios1:

    Service Ok[03-20-2014 21:34:53] SERVICE ALERT: hostDBFios;PG BACKENDS;OK;SOFT;2;POSTGRES_BACKENDS OK: DB "postgres" 9 of 100 connections (9%)
    Service Critical[03-20-2014 21:34:03] SERVICE ALERT: hostDBFios;PG BACKENDS;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

在 DB 主机上,此内容位于 /var/log/messages 中:

    Mar 20 21:34:01 hostdb nrpe[28248]: Could not read request from client, bailing out...
    Mar 20 21:34:01 hostdb nrpe[28248]: INFO: SSL Socket Shutdown.

我还没弄清楚这可能是什么,搜索了错误但大多数是稳定问题而不是间歇性问题,也许与 SSL/SSH 有关?


更新:问题 2 的新超时如下。

问题 2

在我发布以下问题 1 之前,我们的一个 Internet 连接在 [03-17-2014 19:37:50] 确实出现过中断。这是康卡斯特的商业连接。这种情况偶尔会发生,而且是在一年多前我开始工作之前发生的,但它显示为 DNS 警报,这由所有者以及 Nagios1 和 Nagios2 轻松实现。我们可以忽略这次中断,康卡斯特从未对此作出答复,但至少它被确定为一个连接,并且是一次完全中断,可以快速恢复。如果能对此进行故障排除就好了,不过它不那么重要。两个 Nagios 都报告了这次中断的相同情况。我们在这个连接上有一台服务器,上面有许多警报,这里还有另一台名为 hostDBCable 的服务器,它只有一个主机警报,因为它是 FIOS 的双 WAN,我监控该连接上的其余服务:

Nagios 2:

    [03-17-2014 19:45:40] SERVICE ALERT: host81;DISK SPACE;OK;HARD;1;DISK OK - free space: / 410086 MB (94% inode=99%): /boot 81 MB (87% inode=99%): /dev/shm 3994 MB (100% inode=99%):
    Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB BACKENDS;OK;HARD;1;POSTGRES_PGBOUNCER_BACKENDS OK: DB "pgbouncer" 1 of 1000 connections (1%)
    Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB MAXWAIT;OK;HARD;1;POSTGRES_PGB_POOL_MAXWAIT OK: DB "pgbouncer" pgbouncer=0 * phoneworks=0 * phoneworksNew=0
    Service Ok[03-17-2014 19:44:40] SERVICE ALERT: host81;MEMORY;OK;HARD;1;OK - 89.0% (7284288 kB) free.
    Service Ok[03-17-2014 19:44:30] SERVICE ALERT: host81;MEMORY SWAP;OK;HARD;1;SWAP OK - 100% free (1983 MB out of 1983 MB)
    Service Ok[03-17-2014 19:43:30] SERVICE ALERT: host81;CPU LOAD;OK;HARD;1;OK - load average: 0.14, 0.19, 0.18
    Host Up[03-17-2014 19:41:50] HOST ALERT: host81;UP;HARD;1;SSH OK - OpenSSH_4.3 (protocol 2.0)
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAX BACKEND;OK;HARD;2;OK - 2 connections
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAXMAXWAIT;OK;HARD;2;OK - queries waiting 0.00 seconds
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;IVR LONG QUERY;OK;HARD;2;OK - No flag file
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;OK;HARD;2;OK - No ERROR found
    Host Up[03-17-2014 19:41:10] HOST ALERT: hostDBCable;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Service Critical[03-17-2014 19:40:40] SERVICE ALERT: host81;DISK SPACE;CRITICAL;HARD;1;Connection refused or timed out
    Service Critical[03-17-2014 19:40:20] SERVICE ALERT: host81;PGB MAXWAIT;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:40:10] SERVICE ALERT: host81;PGB BACKENDS;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:40:10] HOST ALERT: hostDBCable;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
    Service Critical[03-17-2014 19:39:50] SERVICE ALERT: host81;MEMORY;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:39:30] SERVICE ALERT: host81;MEMORY SWAP;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:39:00] HOST ALERT: host81;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
    Host Down[03-17-2014 19:38:50] HOST ALERT: hostDBCable;DOWN;SOFT;1;No route to host
    Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;HARD;2;Connection refused or timed out
    Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;HARD;2;Connection refused or timed out
    Service Critical[03-17-2014 19:38:30] SERVICE ALERT: host81;CPU LOAD;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:38:00] HOST ALERT: host81;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

Nagios 2:

    Host Up[03-19-2014 11:18:10] HOST ALERT: hostDBCable;UP;SOFT;2;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-19-2014 11:17:00] HOST ALERT: hostDBCable;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds

问题 1

回复评论:

有些问题很难发现,或者如果在故障排除期间发生,我也找不到问题所在。最近一次超时是昨天,实际上是主机宕机,而不是像以前的问题那样的服务超时。主机检查是 check_ssh。

出现问题时,我正在 hostbw 所在的位置工作。我没有遇到互联网问题,可以从 hostbw 访问所有位置。Nagios1 无法通过 ssh 连接到 hostbw,Nagios2 可以。这似乎是 DNS 解析问题,Nagios1 服务器可以通过 nslookup hostbw 找到正确的 IP(这是 DynDNS 的域名),但无法通过 ssh,我认为无法解析主机名,但 Nagios2 可以 SSH。我检查了两台服务器,它们的设置相同。它们都使用路由器作为 DNS,具有几乎相同的 /etc/hosts 和 /etc/resolv.conf。有什么想法吗?

Nagios1:

    Host stopped flapping[03-17-2014 19:23:50] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (3.9% change < 5.0% threshold)
    Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:36:20] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:35:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 15:29:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:19:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:18:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:59:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:57:50] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 14:56:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:54:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:51:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host started flapping[03-17-2014 14:50:20] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (23.9% change > 20.0% threshold)
    Host Down[03-17-2014 14:50:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:47:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:45:10] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 14:44:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:40:20] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 11:03:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 11:02:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:

Nagios2的:

    Host stopped flapping[03-17-2014 16:36:40] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (4.7% change < 5.0% threshold)
    Host started flapping[03-17-2014 15:39:00] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (20.1% change > 20.0% threshold)
    Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:34:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:33:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 15:27:40] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:22:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:22:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 12:46:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 12:16:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 12:15:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 11:22:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 11:02:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 11:02:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:

答案1

我向 Verizon 支持部门开具了票据,尝试了几次之后,有人继续保持票据打开状态并努力了解问题所在。

我设置了跟踪路由(linux tracepath)从 Nagios 服务器持续运行到我们超时的数据中心。

在我们发送一条跟踪路由后,该路由显示在同一时间路由在 Verizon 路由器上停止,我们经历了另一次中断(在 Nagios 中检测到),但问题没有再次发生。

我连续一个多月向 Verizon 技术人员发送超时信息,但是这种情况很少发生,而且没有一个像我们遇到的问题那么严重,而且似乎都与 Verizon 网络无关。

他说“看来问题已经解决了”。由于一年多来我们一直有 5-10 分钟的停电,而现在再也没有出现过这种情况,所以我认为有些问题已经得到解决。

相关内容