我最近买了一块 SuperMicro X10SLL-F 主板,它有一个内置的 BMC(Aspeed AST2400 芯片)。我想在服务器上运行 Linux(gentoo 强化)时使用内置的看门狗控制器。
我在 bios 中启用了看门狗功能,然后将主板跳线从硬重置切换到 NMI(看门狗超时操作,用于测试目的以避免重新启动)。关于软件——我安装并添加了默认运行级别看门狗程序(sys-apps/watchdog),该程序配置为每 10 秒 ping 一次看门狗设备(/dev/watchdog,存在)。看门狗超时设置为 250 秒。
程序显然看到了看门狗硬件(启用了 openipmi 的 ipmitool):
# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 254 sec
Present Countdown: 253 sec
免费ipmi:
# bmc-watchdog --get
Timer Use: SMS/OS
Timer: Running
Logging: Enabled
Timeout Action: Hard Reset
Pre-Timeout Interrupt: None
Pre-Timeout Interval: 0 seconds
Timer Use BIOS FRB2 Flag: Clear
Timer Use BIOS POST Flag: Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag: Set
Timer Use BIOS OEM Flag: Clear
Initial Countdown: 254 seconds
Current Countdown: 253 seconds
然而,经过一段时间后我得到了(上述程序报告了良好的“当前倒计时”值):
[ 294.107534] Uhhuh. NMI received for unknown reason 21 on CPU 0.
[ 294.107998] Do you have a strange power saving mode enabled?
[ 294.108437] Dazed and confused, but trying to continue
这是 NMI,显然是由看门狗超时引起的。不到一分钟后,机器就硬重置了。
问题出在哪里?应该朝哪个方向挖掘?
编辑:与 ipmi 相关的内核消息:
[ 0.353090] ipmi message handler version 39.2
[ 0.353353] ipmi device interface
[ 0.353623] IPMI System Interface driver.
[ 0.353898] ipmi_si: probing via ACPI
[ 0.354172] ipmi_si 00:08: [io 0x0ca2] regsize 1 spacing 1 irq 0
[ 0.354444] ipmi_si: Adding ACPI-specified kcs state machine
[ 0.354790] ipmi_si: probing via SMBIOS
[ 0.355051] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[ 0.355317] ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface
[ 0.355836] ipmi_si: probing via SPMI
[ 0.356095] ipmi_si: SPMI: io 0xca2 regsize 1 spacing 1 irq 0
[ 0.356362] ipmi_si: Adding SPMI-specified kcs state machine duplicate interface
[ 0.356906] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0
[ 0.390536] ipmi_si: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
[ 0.418476] ipmi_si 00:08: Found new BMC (man_id: 0x002a7c, prod_id: 0x0801, dev_id: 0x20)
[ 0.419004] ipmi_si 00:08: IPMI kcs interface initialized
[ 0.419272] IPMI SSIF Interface driver
[ 0.420350] IPMI Watchdog: driver initialized
[ 0.420635] Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot.
[ 0.421444] IPMI poweroff: ATCA Detect mfg 0x2A7C prod 0x801
[ 0.421710] IPMI poweroff: Found a chassis style poweroff function
编辑:我尝试使用配置为“-u 4 -p 2 -a 0 -F -P -L -O -i 300 -e 10”的 bmc-watchdog。因此,仅使用 SMS/OS 时间,超时前中断设置为 NMI,超时操作设置为 NONE:
# bmc-watchdog --get
Timer Use: SMS/OS
Timer: Running
Logging: Enabled
Timeout Action: None
Pre-Timeout Interrupt: NMI / Diagnostic Interrupt
Pre-Timeout Interval: 0 seconds
Timer Use BIOS FRB2 Flag: Clear
Timer Use BIOS POST Flag: Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag: Set
Timer Use BIOS OEM Flag: Clear
Initial Countdown: 300 seconds
Current Countdown: 290 seconds
但这并没有带来任何改变。
编辑。此外,当我通过将 \0x00 回显到 /dev/watchdog 来触发看门狗计时器,然后保持其不变时 - 系统会在默认的 10 秒超时后正确重新启动。因此看门狗工作正常,但在启动后 350 秒系统重新启动。
编辑。我检查了 BMC 系统事件日志 (SEL),并在重启后发现了以下内容:
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Timer expired, status only ; Timer use at expiration = SMS/OS ; Interrupt type = none
这里有趣的是——该事件被标记为“仅状态”。即便如此,系统也会重新启动。当我故意触发看门狗超时时,日志会有所不同:
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none
Sensor #202 | Watchdog 2 | Assertion Event | Hard Reset ; Timer use at expiration = SMS/OS ; Interrupt type = none
答案1
最后,我找到了一个有点奇怪的解决方案:只需将看门狗跳线(JWD1)保持打开状态(既不选择 NMI 也不选择硬重置)。在 BIOS 设置中启用看门狗。
在这种情况下,看门狗按预期工作 - 系统在 bmc-watchdog 运行时稳定 25 分钟,并在看门狗程序终止后重新启动。