排除 Nagios 警报故障；又名。为什么我的警报没有触发？

2024-5-28 • tag-icon

我正在尝试向现有的 Nagios 安装添加电子邮件警报。几个月来，我一直在使用 Web 界面监视一些非关键系统，并且运行良好；可以毫无问题地检测到警告和关键问题。

我的下一步是启用警报功能，但尽管花了几个小时摆弄，我还是无法得到最簡單警惕火灾。我完全不知道可能出了什么问题。这几乎肯定是一些我没能注意到的简单问题，所以希望你们中的一个人能轻松发现它。

我测试的命令非常简单。最初我只是尝试写入文件：

define command{
        command_name    alerter
        command_line    echo "Alerter command fired by Nagios" >> /usr/local/nagios/var/alerter.log
}

我已经测试过 nagios 用户可以使用 sudo 执行此命令。一切似乎都很好。

主机和服务均引用“管理员”联系人组。这些是它们使用的模板，它们均不会覆盖任何这些设置。

define host{
        name                            generic-host
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        failure_prediction_enabled      1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        check_period                    24x7
        check_interval                  1
        retry_interval                  1
        max_check_attempts              10
        check_command                   check-host-alive
        notification_period             24x7
        notification_interval           120
        notification_options            d,u,r,s,f
        contact_groups                  admins
        register                        0
}
define service{
        name                            generic-service
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        failure_prediction_enabled      1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  admins
        notification_options            w,u,c,r
        notification_interval           120
        notification_period             24x7
        register                        0
}

联系人和联系人组的配置如下：

define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   alerter
        host_notification_commands      alerter
        register                        0
}
define contact{
        contact_name            nagiosadmin
        use                     generic-contact
        alias                   Nagios Admin
        email                   [email protected]
}
define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
}

当我造成中断时，Nagios 会发现它并像这样记录下来......

[1315210448] SERVICE ALERT: ifs.aleph;Test service;CRITICAL;HARD;3;HTTP CRITICAL: HTTP/1.1 400 Bad Request - string 'Blah blah' not found on 'http://aleph.tekretic.com.au:80/' - 168 bytes in 0.369 second response time
[1315210653] SERVICE ALERT: ifs.aleph;Test service;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 416 bytes in 0.364 second response time

..但我的“alerter.log”文件中没有任何记录。就好像从未触发过 alerter 命令一样。

我错过了什么？

答案1

确保具有以下内容nagios.cfg：

log_notifications=1
enable_notifications=1

还可以尝试创建debug_level32 个通知来查看其内容：

debug_level=32

答案1

相关内容