Nagios 发送通知电子邮件时行为异常

Nagios 发送通知电子邮件时行为异常

我在设置 nagios3 以执行我想要的操作时遇到了很大困难。配置文件太多,而且不确定问题到底出在哪里,因为一切似乎都正确。

首先,针对宕机主机和关键服务发送通知,然后我想对其进行配置,以便它也在恢复时发送通知,现在它只发送这些通知,而不是发送所有通知。

我想要配置它的方式是使用通用服务作为模板,然后根据需要配置其他详细信息,但它不起作用,这是我的配置文件,看看你是否发现任何错误:

我想要的很简单。当主机宕机、服务紧急以及恢复时发送电子邮件 - 就是这样!

----文件联系人.cfg ---

define contact{
        contact_name                    admin
        alias                           administrator
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    u,c,r
        host_notification_options       d,u,r
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        email                           [email protected]
        }


define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 admin
        }

---------------------EOF-----------

------文件 generic-service.cfg ---------

define service{
        name                            generic-service ; The 'name' of this service template
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        notification_interval           0       ; Only send notifications on status change by default.
        is_volatile                     0
        check_period                    24x7
        normal_check_interval           1
        retry_check_interval            1
        max_check_attempts              4
        notification_period             24x7
        notification_options            w,u,c,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

---------------EOF--------

----generic-host.cfg 文件----

define host{
        name                            generic-host    ; The name of this host template
        notifications_enabled           1       ; Host notifications are enabled
        event_handler_enabled           1       ; Host event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
#       check_command                   check-host-alive
        check_command                   check_tcp_alive
        max_check_attempts              10
        notification_interval           0
        notification_period             24x7
        notification_options            d,u,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

----摘自 servicegroups.cfg-----

define service {
        hostgroup_name                  Live, inhouse
        service_description             USERS
        check_command                   check_nrpe_1arg!check_users
        use                             generic-service
    normal_check_interval

               10
            retry_check_interval            10
            contact_groups                  admins
            notification_interval           0 ; set > 0 if you want to be renotified
    }

    # check the LOAD
    define service {
            hostgroup_name                  Live, inhouse
            service_description             LOAD
            check_command                   check_nrpe_1arg!check_load
            use                             generic-service
        normal_check_interval           5
            retry_check_interval            1
            notification_interval           0 ; set > 0 if you want to be renotified
    }       


    # check the HDD
    define service {
            hostgroup_name                  Live, inhouse
            service_description             HDD
            check_command                   check_nrpe_1arg!check_all_disks
            use                             generic-service
        normal_check_interval           600
            retry_check_interval            30
            notification_interval           0 ; set > 0 if you want to be renotified
    }

-----EOF-----

--- 摘自 Hostgroups.cfg----

define hostgroup {
        hostgroup_name  http-servers
        alias           HTTP servers
        members         *
        }

----文件结尾-----

答案1

我觉得您的配置有点不对劲。如果检查不成功,Nagios 将每隔“retry_check_interval”(每次重试之间的时间)X“max_check_attempts”(连续失败次数)重新检查一次它会发出警报,表示有东西坏了。在“HDD”检查过程中,这意味着硬盘需要处于非正常状态 2 小时您会收到通知。如果检查应返回正常状态如果满足上述条件,则不会发送失败通知。但是,您将收到恢复通知。对于“LOAD”检查,这种情况很可能会发生,即使 retry_check_interval 要小得多,因为系统使用情况通常非常动态。

此外,我不赞成将通知间隔设置为“0”——我觉得这是一种非常糟糕的做法,会导致错过警报,尤其是在 generic-* 模板上。我在模板中将我的间隔设置为“60”分钟,然后在那些我不想经常听到的少数检查中使用“240”分钟。

您还应该再次检查“hostgroup.cfg”文件。您在检查中定义的主机组未列在您的示例中的主机组配置文件中。

在 Nagios 3 及更高版本中:

“retry_check_interval” 更改为“retry_interval”

‘normal_check_interval’ 更改为 ‘check_interval’

也就是说,为了与旧版本的配置文件向后兼容,所有四个仍然受支持 - 即使在 Nagios 版本 4 中。

相关内容