我在设置 nagios3 以执行我想要的操作时遇到了很大困难。配置文件太多,而且不确定问题到底出在哪里,因为一切似乎都正确。
首先,针对宕机主机和关键服务发送通知,然后我想对其进行配置,以便它也在恢复时发送通知,现在它只发送这些通知,而不是发送所有通知。
我想要配置它的方式是使用通用服务作为模板,然后根据需要配置其他详细信息,但它不起作用,这是我的配置文件,看看你是否发现任何错误:
我想要的很简单。当主机宕机、服务紧急以及恢复时发送电子邮件 - 就是这样!
----文件联系人.cfg ---
define contact{
contact_name admin
alias administrator
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,c,r
host_notification_options d,u,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email [email protected]
}
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members admin
}
---------------------EOF-----------
------文件 generic-service.cfg ---------
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_interval 0 ; Only send notifications on status change by default.
is_volatile 0
check_period 24x7
normal_check_interval 1
retry_check_interval 1
max_check_attempts 4
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
---------------EOF--------
----generic-host.cfg 文件----
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
# check_command check-host-alive
check_command check_tcp_alive
max_check_attempts 10
notification_interval 0
notification_period 24x7
notification_options d,u,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
----摘自 servicegroups.cfg-----
define service {
hostgroup_name Live, inhouse
service_description USERS
check_command check_nrpe_1arg!check_users
use generic-service
normal_check_interval
10
retry_check_interval 10
contact_groups admins
notification_interval 0 ; set > 0 if you want to be renotified
}
# check the LOAD
define service {
hostgroup_name Live, inhouse
service_description LOAD
check_command check_nrpe_1arg!check_load
use generic-service
normal_check_interval 5
retry_check_interval 1
notification_interval 0 ; set > 0 if you want to be renotified
}
# check the HDD
define service {
hostgroup_name Live, inhouse
service_description HDD
check_command check_nrpe_1arg!check_all_disks
use generic-service
normal_check_interval 600
retry_check_interval 30
notification_interval 0 ; set > 0 if you want to be renotified
}
-----EOF-----
--- 摘自 Hostgroups.cfg----
define hostgroup {
hostgroup_name http-servers
alias HTTP servers
members *
}
----文件结尾-----
答案1
我觉得您的配置有点不对劲。如果检查不成功,Nagios 将每隔“retry_check_interval”(每次重试之间的时间)X“max_check_attempts”(连续失败次数)重新检查一次前它会发出警报,表示有东西坏了。在“HDD”检查过程中,这意味着硬盘需要处于非正常状态 2 小时前您会收到通知。如果检查应返回正常状态前如果满足上述条件,则不会发送失败通知。但是,您将收到恢复通知。对于“LOAD”检查,这种情况很可能会发生,即使 retry_check_interval 要小得多,因为系统使用情况通常非常动态。
此外,我不赞成将通知间隔设置为“0”——我觉得这是一种非常糟糕的做法,会导致错过警报,尤其是在 generic-* 模板上。我在模板中将我的间隔设置为“60”分钟,然后在那些我不想经常听到的少数检查中使用“240”分钟。
您还应该再次检查“hostgroup.cfg”文件。您在检查中定义的主机组未列在您的示例中的主机组配置文件中。
在 Nagios 3 及更高版本中:
“retry_check_interval” 更改为“retry_interval”
‘normal_check_interval’ 更改为 ‘check_interval’
也就是说,为了与旧版本的配置文件向后兼容,所有四个仍然受支持 - 即使在 Nagios 版本 4 中。