Corosync/Pacemaker/DRBD 弹性调整

Corosync/Pacemaker/DRBD 弹性调整

我有一个 DRBD 集群,其中一个节点关闭了几天。单个节点运行良好,没有出现任何问题。当我打开它时,我遇到了这样一种情况:所有资源都停止了,一个 DRBD 卷是次要的,而其他的是主要的,因为它似乎试图对刚刚打开的节点执行角色交换(ha1 处于活动状态,然后为了理解日志,我在 08:06 打开了 ha2)

我的问题:

  • 有人能帮我弄清楚这里发生了什么吗?(​​如果这个问题太费力,我愿意考虑付费咨询以获得正确的配置)。
  • 顺便问一下,如果情况自行解决,有没有办法让 PC 自行清理资源?如果故障转移后故障情况消除,LinuxHA 集群不需要干预,所以我要么被宠坏了,要么不知道如何实现这一点。

以下是我可以想象有人可能会要求的所有可能有用的信息。

bash-5.1# cat /proc/drbd 
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50 
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

集群状态最​​终为

bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)
Cluster Summary:
  * Stack: corosync
  * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum
  * Last updated: Thu Aug 10 08:38:40 2023
  * Last change:  Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local
  * 2 nodes configured
  * 14 resource instances configured

Node List:
  * Online: [ ha1.local ha2.local ]

Full List of Resources:
  * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
    * Promoted: [ ha2.local ]
    * Unpromoted: [ ha1.local ]
  * Resource Group: nsdrbd:
    * LV_BLOBFS (ocf:heartbeat:Filesystem):  Started ha2.local
    * LV_POSTGRESFS (ocf:heartbeat:Filesystem):  Stopped
    * LV_HOMEFS (ocf:heartbeat:Filesystem):  Stopped
    * ClusterIP (ocf:heartbeat:IPaddr2):     Stopped
  * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * postgresql  (systemd:postgresql):    Stopped
  * Clone Set: LV_HOME-clone [LV_HOME] (promotable):
    * Promoted: [ ha1.local ]
    * Unpromoted: [ ha2.local ]
  * ns_mhswdog  (lsb:mhswdog):   Stopped
  * Clone Set: pingd-clone [pingd]:
    * Started: [ ha1.local ha2.local ]

Failed Resource Actions:
  * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s
  * LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

我附上两个节点的日志

Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 0 is up
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1032387] ha1.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:07:07 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:11:48 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:11:50 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] link: host: 2 link: 1 is down
Aug 10 08:12:22 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] rx: host: 2 link: 1 is up
Aug 10 08:12:23 [1032387] ha1.local corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Aug 10 08:06:55 [1128] ha2.local corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Aug 10 08:06:55 [1128] ha2.local corosync info    [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
Aug 10 08:06:56 [1128] ha2.local corosync notice  [TOTEM ] Initializing transport (Kronosnet).
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] totemknet initialized
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cmap
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cfg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: cpg
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: votequorum
Aug 10 08:06:57 [1128] ha2.local corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 10 08:06:57 [1128] ha2.local corosync info    [QB    ] server name: quorum
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 0
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configuring link 1
Aug 10 08:06:57 [1128] ha2.local corosync info    [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET  ] host: host: 1 has no active links
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [TOTEM ] A new membership (2.126) was formed. Members joined: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [QUORUM] Members[1]: 2
Aug 10 08:06:57 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 0 is up
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 469
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Sync joined[1]: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [TOTEM ] A new membership (1.12d) was formed. Members joined: 1
Aug 10 08:07:00 [1128] ha2.local corosync notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] This node is within the primary component and will provide service.
Aug 10 08:07:00 [1128] ha2.local corosync notice  [QUORUM] Members[2]: 1 2
Aug 10 08:07:00 [1128] ha2.local corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:07:05 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885
Aug 10 08:07:08 [1128] ha2.local corosync info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:14:13 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:14:15 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:19:53 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:19:54 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] link: host: 1 link: 1 is down
Aug 10 08:23:18 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] rx: host: 1 link: 1 is up
Aug 10 08:23:19 [1128] ha2.local corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

我还附上了 pcs config show 的输出,并且还可以根据要求提供 pcs cluster cib

Cluster Name: HA
Corosync Nodes:
 ha1.local ha2.local
Pacemaker Nodes:
 ha1.local ha2.local

Resources:
  Resource: postgresql (class=systemd type=postgresql)
    Operations:
      monitor: postgresql-monitor-interval-60s
        interval=60s
      start: postgresql-start-interval-0s
        interval=0s
        timeout=100
      stop: postgresql-stop-interval-0s
        interval=0s
        timeout=100
  Resource: ns_mhswdog (class=lsb type=mhswdog)
    Operations:
      force-reload: ns_mhswdog-force-reload-interval-0s
        interval=0s
        timeout=15
      monitor: ns_mhswdog-monitor-interval-60s
        interval=60s
        timeout=10s
        on-fail=standby
      restart: ns_mhswdog-restart-interval-0s
        interval=0s
        timeout=140s
      start: ns_mhswdog-start-interval-0s
        interval=0s
        timeout=80s
      stop: ns_mhswdog-stop-interval-0s
        interval=0s
        timeout=80s
  Group: nsdrbd
    Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_BLOBFS-instance_attributes
        device=/dev/drbd0
        directory=/data
        fstype=ext4
      Operations:
        monitor: LV_BLOBFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_BLOBFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_BLOBFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_POSTGRESFS-instance_attributes
        device=/dev/drbd1
        directory=/var/lib/pgsql
        fstype=ext4
      Operations:
        monitor: LV_POSTGRESFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_POSTGRESFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_POSTGRESFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: LV_HOMEFS-instance_attributes
        device=/dev/drbd2
        directory=/home
        fstype=ext4
      Operations:
        monitor: LV_HOMEFS-monitor-interval-20s
          interval=20s
          timeout=40s
        start: LV_HOMEFS-start-interval-0s
          interval=0s
          timeout=60s
        stop: LV_HOMEFS-stop-interval-0s
          interval=0s
          timeout=60s
    Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
      Attributes: ClusterIP-instance_attributes
        cidr_netmask=32
        ip=192.168.51.75
      Operations:
        monitor: ClusterIP-monitor-interval-60s
          interval=60s
        start: ClusterIP-start-interval-0s
          interval=0s
          timeout=20s
        stop: ClusterIP-stop-interval-0s
          interval=0s
          timeout=20s
  Clone: LV_BLOB-clone
    Meta Attributes: LV_BLOB-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
      Attributes: LV_BLOB-instance_attributes
        drbd_resource=lv_blob
      Operations:
        demote: LV_BLOB-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_BLOB-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_BLOB-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_BLOB-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_BLOB-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_BLOB-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_BLOB-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_BLOB-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_POSTGRES-clone
    Meta Attributes: LV_POSTGRES-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
      Attributes: LV_POSTGRES-instance_attributes
        drbd_resource=lv_postgres
      Operations:
        demote: LV_POSTGRES-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_POSTGRES-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_POSTGRES-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_POSTGRES-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_POSTGRES-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_POSTGRES-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_POSTGRES-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_POSTGRES-stop-interval-0s
          interval=0s
          timeout=100
  Clone: LV_HOME-clone
    Meta Attributes: LV_HOME-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=1
      promoted-node-max=1
    Resource: LV_HOME (class=ocf provider=linbit type=drbd)
      Attributes: LV_HOME-instance_attributes
        drbd_resource=lv_home
      Operations:
        demote: LV_HOME-demote-interval-0s
          interval=0s
          timeout=90
        monitor: LV_HOME-monitor-interval-60s
          interval=60s
          role=Promoted
        monitor: LV_HOME-monitor-interval-63s
          interval=63s
          role=Unpromoted
        notify: LV_HOME-notify-interval-0s
          interval=0s
          timeout=90
        promote: LV_HOME-promote-interval-0s
          interval=0s
          timeout=90
        reload: LV_HOME-reload-interval-0s
          interval=0s
          timeout=30
        start: LV_HOME-start-interval-0s
          interval=0s
          timeout=240
        stop: LV_HOME-stop-interval-0s
          interval=0s
          timeout=100
  Clone: pingd-clone
    Resource: pingd (class=ocf provider=pacemaker type=ping)
      Attributes: pingd-instance_attributes
        dampen=6s
        host_list=192.168.51.251
        multiplier=1000
      Operations:
        monitor: pingd-monitor-interval-10s
          interval=10s
          timeout=60s
        reload-agent: pingd-reload-agent-interval-0s
          interval=0s
          timeout=20s
        start: pingd-start-interval-0s
          interval=0s
          timeout=60s
        stop: pingd-stop-interval-0s
          interval=0s
          timeout=20s

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: ClusterIP
    Constraint: location-ClusterIP
      Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
        Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
        Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
  promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
  promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
  start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
  promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
  start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
  start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
  start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
  start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
  LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
  LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
  postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
  LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
  ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
  ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
  ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
  ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
  Meta Attrs: build-resource-defaults
    resource-stickiness=INFINITY
Operations Defaults:
  Meta Attrs: op_defaults-meta_attributes
    timeout=240s

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: HA
 dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
 have-watchdog: false
 last-lrm-refresh: 1688971748
 maintenance-mode: false
 no-quorum-policy: ignore
 stonith-enabled: false

Tags:
 No tags defined

Quorum:
  Options:

答案1

像 Pacemaker 这样的集群具有高度可配置性,并且要小心谨慎,因为它们可能会在节点之间移动块设备(您的数据)。在各种情况下测试集群的行为是必须的。理想情况下,这会形成在各种情况下应该做什么的操作手册。

阅读手册。支持 Pacemaker 的下游发行版,如RHEL HA 指南,解释很多场景并提供参考。

您可以配置资源以首选当前正在运行的节点。集群范围内的非零值可能是一个好主意,例如

pcs resource defaults update resource-stickiness=1

具有高于此值的位置约束仍然可以移动资源。

执行集群维护主题有关于移动、停止和启动的参考。节点的计划维护可能应该包含在pcs node standbythen中pcs node unstandby

尝试pcs resource move将资源组放在同一节点上。观察移动约束消失后会发生什么。如果资源移回而您不希望这样,请排除粘性、约束、依赖性和其他规则的故障。

DRDB 状态为 UpToDate 表示您的卷是健康的。使用集群工具移动内容,并确认卷已成功挂载。

双节点集群实际上很难实现。当它们分区时,没有很好的方法来选择主节点。考虑添加另一个节点以达到仲裁目的,即使它没有磁盘并且无法承载此资源组。

答案2

这是配置错误的隔离或仲裁拾取问题。这是非常典型的 DRBD 行为,它在正常工作时可以正常工作,但在某些时候会因裂脑或复制卷卡在错误状态而失败,而且没有真正的原因。我建议不要修复 DRBD,因为它在设计上无法修复,但要尽最大努力放弃它并使用更可靠的现成产品。我非常推荐 Ceph,因为它稳定、性能出色,并且社区支持力度很大。

相关内容