发生内核错误 (oop) 时如何运行脚本？

Question

我的回答是，您可以rsyslog通过 Shell Execute 操作符 ( ^program-to-execute;template) 安装并运行该脚本。但是，这可能行不通，因为内核出错后系统肯定会不负责任，不会运行自定义脚本。

因此，我建议你运行一个脚本在另一台服务器中当发生内核 oop 时。例如：

在最终产生内核 oop 的服务器中，使用网络控制台模块。

# /etc/modprobe.d/netconsole.conf
# This example assumes 10.0.0.1 as the "bad" server and 10.0.0.2 as the "monitor" server
options netconsole [email protected]/eth0,[email protected]/01:23:45:67:89:AB
options netconsole oops_only=1


# /etc/modules-load.d/netconsole.conf
# Tells 'systemd-modules-load' to load 'netconsole' automatically at boot
netconsole

在监视服务器（接收内核 oops 的服务器）中，通过运行自定义脚本rsyslog。

# /etc/rsyslog.d/kernel-oops-handler.conf

module(load="imudp")

input(type="imudp" 
    port="30514"
    ruleset="KernelOopsRuleSet")

# This aims to supply the IP address of the "bad" server in command line
template(name="KernelOopsArgs"
    type="string"
    string="%fromhost-ip%")

ruleset(name="KernelOopsRuleSet") {
    # This assumes that the '--[ cut here ]--' string is a kernel oops evidence
    if ($msg contains "------------[ cut here ]------------") then {
        kern.crit ^/path/to/custom/script.sh;KernelOopsArgs
    }
}

自定义脚本可以通过带外管理接口（戴尔服务器上的 iDRAC）重新启动机器：

#!/bin/bash
# /path/to/custom/script.sh
# A successful SSH to the host indicates the server is responsible
sleep 3
server="${1}"
if ! ssh -n -o ConnectTimeout=10 -o ControlPath=none "${server}" true; then
    # Let me suppose 10.100.0.1 is the iDRAC IP address of a server whose IP is 10.0.0.1
    idrac="`echo \"${server}\" | sed 's/^10\.0\./10.100./'`"
    # Trigger a forced reboot using 'ipmitool'
    ipmitool -H "${idrac}" -U root -P root chassis power reset
    # Notify administrators
    mail -s "Server '${server}' was restarted!" [email protected] < /dev/null
fi

Answer 1