看门狗守护程序无法重置 Supermicro X9DR3-F 主板上的硬件看门狗定时器

看门狗守护程序无法重置 Supermicro X9DR3-F 主板上的硬件看门狗定时器

我有一个超微 X9DR3-F 主板其中JWD跳线引脚 1 和 2 短路并且 UEFI 中的看门狗功能启用: 超微UEFI

这意味着如果硬件看门狗定时器没有被重置,系统将在大约 5 分钟后重置。我安装了watchdog守护进程并将其配置为使用iTCO_wdt驱动程序:

$ cat /etc/default/watchdog 
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="iTCO_wdt"
# Specify additional watchdog options here (see manpage).
$ 

watchdog守护进程启动时,驱动程序将毫无问题地加载:

$ sudo dmesg | grep iTCO_wdt
[   17.435620] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[   17.435667] iTCO_wdt: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
[   17.435761] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
$ 

此外,该/dev/watchdog文件也存在:

$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec  8 22:36 /dev/watchdog
$ 

watchdog-device守护进程配置中的选项watchdog指向此文件:

$ grep -v ^# /etc/watchdog.conf 



watchdog-device    = /dev/watchdog
watchdog-timeout   = 60


interval           = 5
log-dir            = /var/log/watchdog
verbose            = yes
realtime           = yes
priority           = 1

heartbeat-file     = /var/log/watchdog/heartbeat
heartbeat-stamps   = 1000
$ 

为了调试对看门狗设备的写入,我启用了heartbeat-file选项并查看/dev/watchdog发送的保活消息:

$ tail /var/log/watchdog/heartbeat
 1575830728
 1575830728
 1575830728
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
$ 

然而,尽管如此,服务器会以大约五分钟的间隔自行重置。

我的下一个想法是,也许iTCO_wdt驱动程序控制看门狗C606芯片组而重置服务器的看门狗则是 IPMI 的一部分。因此,我确保iTCO_wdt在启动过程中未加载驱动程序并重新启动服务器。公平地说,the/dev/watchdog已经不存在了。现在我加载了ipmi_watchdog模块:

$ ls -l /dev/watchdog
ls: cannot access '/dev/watchdog': No such file or directory
$ sudo modprobe ipmi_watchdog
$ sudo dmesg -T | tail -1
[Tue Dec 10 21:12:48 2019] IPMI Watchdog: driver initialized
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 10 21:12 /dev/watchdog
$ 

..最后启动了watchdog基于文件的守护进程,该守护进程以 5 秒的间隔/var/log/watchdog/heartbeat写入。/dev/watchdog此外,可以通过以下方式确认这一点strace

$ ps -p 2296 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root      2296     1  0 01:28 ?        00:00:00 /usr/sbin/watchdog
$ sudo strace -y -p 2296
strace: Process 2296 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, ^Cstrace: Process 2296 detached
 <detached ...>
$

watchdog上面带有 PID 的守护进程以注释掉选项 in2296的方式启动,以减少 的输出中的系统调用。heartbeat-file/etc/watchdog.confwritestrace

但是,服务器仍然会以大约 300 秒的间隔重新启动。

为什么看门狗守护程序无法重置 Supermicro X9DR3-F 主板上的硬件看门狗定时器?

答案1

看门狗守护程序无法重置 Supermicro X9DR3-F 主板上的硬件看门狗定时器的原因是 UEFI 中的看门狗功能控制着第三看门狗。这是在 Winbond Super I/O 83527 芯片上。换句话说,iTCO_wdt驱动ipmi_watchdog程序对于该看门狗芯片来说是错误的驱动程序。

答案2

在 A2SDi-4C-HLN4F 上,我必须使用bmc_watchdog(from freeipmi) 才能使其正常工作。

相关内容