Windows Server 2012 R2 中群集节点崩溃时出现“0x0000009E”停止错误

Windows Server 2012 R2 中群集节点崩溃时出现“0x0000009E”停止错误

完整内存转储:https://pastebin.com/spkLeVYL

崩溃消息是:

USER_MODE_HEALTH_MONITOR (9e)

One or more critical user mode components failed to satisfy a health check.
Hardware mechanisms such as watchdog timers can detect that basic kernel
services are not executing. However, resource starvation issues, including
memory leaks, lock contention, and scheduling priority misconfiguration,
may block critical user mode components without blocking DPCs or
draining the nonpaged pool.

Kernel components can extend watchdog timer functionality to user mode
by periodically monitoring critical applications. This bugcheck indicates
that a user mode health check failed in a manner such that graceful
shutdown is unlikely to succeed. It restores critical services by
rebooting and/or allowing application failover to other servers.

Arguments:

Arg1: ffffe00026e00780, Process that failed to satisfy a health check within the configured timeout

Arg2: 000000000000003c, Health monitoring timeout (seconds)

Arg3: 000000000000000a, WatchdogSourceClussvcIsAlive
    Cluster service sends heartbeat to netft every 500 millseconds.
    By default netft expects at least 1 heartbeat per second.
    If this watchdog was triggered that means clussvc is o not getting
    CPU to send heartbers.
Arg4: 0000000000000000

用户模式中的某些东西导致故障转移群集服务变得无响应,因此问题出在用户模式进程和常规挂起调试上。群集在用户模式服务和内核模式NetFT驱动程序之间有健康检测。如果用户模式变得无响应,则群集会检查错误以强制进行故障转移。ASTOP 0x9e是预期的群集行为。Astop 0x9enetft.sys,这是由于识别到死锁情况而由群集服务引起的故意错误检查。

我在一篇文章中发现了这一点,我想知道我是否应该改变恢复操作HangRecoveryAction

此属性控制当用户模式进程停止响应时要采取的操作。对于HangRecoveryAction,我们实际上有 4 种不同的设置,其中 3 种是默认设置。

0 = Disables the heartbeat and monitoring mechanism.
1 = Logs an event in the system log of the Event Viewer.
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<– default for 2008

服务器是2012 R2。

相关内容