Pacemaker/Corosync 集群监控操作不断失败（Wildfly）

2024-6-19 • tag-icon

cluster

我有一个基于以下软件的无 STONITH 的 2 节点集群：Ubuntu 18.04.1 LTS Pacemaker 1.1.18 Corosync Cluster Engine，版本“2.4.3”

这不是我构建的第一个集群，但却是第一个基于 Ubuntu 18.04 的集群（到目前为止我一直在使用 16.04）。

配置了以下资源：DRBD 存储、虚拟 IP、数据库 (postgres)、Apache 和 wildfly 服务器

一切都按预期进行，只是 Wildfly 服务重新启动得相当频繁（每 1 至 5 天一次），并且我在 crm_mon 中看到此错误消息：

* Node test-node2:
res_wildfly: migration-threshold=1000000 fail-count=16 last-failure='Sat Jun 8 06:55:20 2019'
* Node test-node1:

Failed Actions:
* res_wildfly_monitor_30000 on test-node2 'unknown error' (1): call=306, status=complete, exitreason='',
last-rc-change='Sat Jun 8 06:55:20 2019', queued=0ms, exec=0ms

The corosync log doesn't reveal much more:
Jun 08 06:55:20 [882] test-node2 crmd: info: process_lrm_event: Result of monitor operation for res_wildfly on test-node2: 1 (unknown error) | call=306 key=res_wildfly_monitor_30000 confirmed=false cib-update=2815

资源配置了类systemd。

有没有人遇到过这个问题，或者知道是什么原因？Wildfly 本来可以稳定运行，但由于某些监控明显出现故障，操作系统（来自集群管理器）会发出重新启动信号。禁用监控不是一种选择，因为这样我们就不会注意到 Wildfly 服务是否不再可用。

如果我可以通过任何额外的日志或配置信息帮助您更好地了解我的情况，请告诉我。

先感谢您。

相关内容