我使用 Pacemaker DRBD 和 corosync 构建了一个 NFS 集群,有两个节点,一切运行正常,但在我尝试不同的故障转移方案的测试中,我的集群完全崩溃了,我无法再切换到主节点,只有第二个节点在工作,所以当我停止辅助节点上的服务时,我的服务就关闭了。我尝试重新同步磁盘,在主服务器上重新创建卷,但 Pacemaker 停止了我的服务组,因为无法挂载该卷
这是我的配置
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: nfs01-master
nodeid: 1
quorum_votes: 1
ring0_addr: 10.x.x.150
}
node {
name: nfs02-slave
nodeid: 2
quorum_votes: 1
ring0_addr: 10.x.x.151
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: nfs-cluster-ha
config_version: 3
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.x.x.0
ringnumber: 0
}
}
两个节点上的 DRBD 配置
resource res1 {
startup {
wfc-timeout 30;
degr-wfc-timeout 15;
}
disk {
on-io-error detach;
no-disk-flushes;
no-disk-barrier;
c-plan-ahead 0;
c-fill-target 24M;
c-min-rate 80M;
c-max-rate 720M;
}
net {
max-buffers 36k;
sndbuf-size 1024k;
rcvbuf-size 2048k;
}
syncer {
rate 1000M;
}
on nfs01-master {
device /dev/drbd0;
disk /dev/nfs01-master-vg/data;
address 10.x.x.150:7788;
meta-disk internal;
}
on nfs02-slave {
device /dev/drbd0;
disk /dev/nfs02-slave-vg/data;
address 10.x.x.151:7788;
meta-disk internal;
}
}
当发生故障转移时,pacemakder 无法在 nfs01-master 上挂载 /dev/drbd0 并卡在次要位置,但当我停止除 DRBD 之外的所有服务时,我将其设为主服务器,我就可以挂载分区
Pacemaker 配置如下
node 1: nfs01-master \
attributes standby=off
node 2: nfs02-slave
primitive drbd_res1 ocf:linbit:drbd \
params drbd_resource=res1 \
op monitor interval=20s
primitive fs_res1 Filesystem \
params device="/dev/drbd0" directory="/data" fstype=ext4
primitive nfs-common lsb:nfs-common
primitive nfs-kernel-server lsb:nfs-kernel-server
primitive virtual_ip_ens192 IPaddr2 \
params ip=10.x.x.153 cidr_netmask=24 nic="ens192:1" \
op start interval=0s timeout=60s \
op monitor interval=5s timeout=20s \
op stop interval=0s timeout=60s \
meta failure-timeout=5s
group services fs_res1 virtual_ip_ens192 nfs-kernel-server nfs-common
ms ms_drbd_res1 drbd_res1 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
location drbd-fence-by-handler-res1-ms_drbd_res1 ms_drbd_res1 \
rule $role=Master -inf: #uname ne nfs02-slave
location location_on_nfs01-master ms_drbd_res1 100: nfs01-master
order services_after_drbd inf: ms_drbd_res1:promote services:start
colocation services_on_drbd inf: services ms_drbd_res1:Master
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.1-9e909a5bdd \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1577978640 \
stop-all-resources=false
和我的 NFS 公共
STATDOPTS="-n 10.x.x.153 --port 32765 --outgoing-port 32766"
NEED_IDMAPD=yes
NEED_GSSD=no
正如我所说的,由于辅助服务器处于活动状态,因此服务可以正常工作,但是故障转移无法正常工作,因为主服务器处于活动状态,因此它在服务上具有优先级,只有当主服务器关闭时,它才会切换到辅助服务器
在我的主要
Stack: corosync
Current DC: nfs01-master (version 2.0.1-9e909a5bdd) - partition WITHOUT quorum
Last updated: Thu Jan 9 15:21:03 2020
Last change: Thu Jan 9 11:58:28 2020 by root via cibadmin on nfs02-slave
2 nodes configured
6 resources configured
Online: [ nfs01-master ]
OFFLINE: [ nfs02-slave ]
Full list of resources:
Resource Group: services
fs_res1 (ocf::heartbeat:Filesystem): Stopped
virtual_ip_ens192 (ocf::heartbeat:IPaddr2): Stopped
nfs-kernel-server (lsb:nfs-kernel-server): Stopped
nfs-common (lsb:nfs-common): Stopped
Clone Set: ms_drbd_res1 [drbd_res1] (promotable)
Slaves: [ nfs01-master ]
Stopped: [ nfs02-slave ]
如果有人能帮忙我将不胜感激
谢谢。
答案1
我能够解决我的问题,当节点丢失时删除集群交换机时,我的配置中有一个隔离选项。
我删除了这一行
location drbd-fence-by-handler-res1-ms_drbd_res1 ms_drbd_res1 \
rule $role=Master -inf: #uname ne nfs02-slave
我必须了解有关隔离的更多信息,并且当从属磁盘写入时,如果磁盘不同步,则不要打开主磁盘