我有一个包含两个节点(s1 和 s2)的 proxmox 集群。在 s2 上列出某个目录时会永远挂起(例如这个问题):
$> strace -vf ls -l /etc/pve/nodes/s2
[...]
open("/etc/pve/nodes/s2", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}) = 0
getdents(3,
查找也挂起
$> cd /etc/pve/nodes/s2
$> strace -vf find .
[...]
openat(AT_FDCWD, ".", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 4
fcntl(4, F_GETFD) = 0
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
fstat(4, {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}) = 0
fcntl(4, F_GETFL) = 0x38800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
newfstatat(AT_FDCWD, ".", {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}, AT_SYMLINK_NOFOLLOW) = 0
fcntl(4, F_DUPFD, 3) = 5
fcntl(5, F_GETFD) = 0
fcntl(5, F_SETFD, FD_CLOEXEC) = 0
getdents(4,
关于 LVM 的部分不相关
我有一个 LVM 物理卷:
$> pvdisplay
--- Physical volume ---
PV Name /dev/sda3
VG Name pve
PV Size 1.82 TiB / not usable 3.07 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 476859
Free PE 4039
Allocated PE 472820
PV UUID fcuPa5-Wscw-wQI2-YXjI-SoMc-nQPe-1orltO
这是 pve 组的一部分
$> pvs
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- 1.82t 15.78g
具有一些逻辑卷:
$> lvscan
ACTIVE '/dev/pve/swap' [8.00 GiB] inherit
ACTIVE '/dev/pve/root' [96.00 GiB] inherit
ACTIVE '/dev/pve/data' [1.70 TiB] inherit
ACTIVE '/dev/pve/vm-401-disk-1' [4.00 GiB] inherit
[...]
关于 LVM 的部分不相关
mount 表示/dev/fuse
已安装在/etc/pve
$> df /etc/pve/nodes/s2
/dev/fuse 30720 36 30684 1% /etc/pve
dmesg
我看到一些这样的错误:
[ 483.990347] INFO: task lxc-pve-prestar:4588 blocked for more than 120 seconds.
[ 483.990554] Tainted: P IO 4.15.18-16-pve #1
[ 483.990721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 483.990943] lxc-pve-prestar D 0 4588 4587 0x00000000
[ 483.990945] Call Trace:
[ 483.990947] __schedule+0x3e0/0x870
[ 483.990949] ? path_parentat+0x3e/0x80
[ 483.990951] schedule+0x36/0x80
[ 483.990953] rwsem_down_write_failed+0x208/0x390
[ 483.990955] call_rwsem_down_write_failed+0x17/0x30
[ 483.990957] ? call_rwsem_down_write_failed+0x17/0x30
[ 483.990959] down_write+0x2d/0x40
[ 483.990961] filename_create+0x7e/0x160
[ 483.990963] SyS_mkdir+0x51/0x100
[ 483.990965] do_syscall_64+0x73/0x130
[ 483.990967] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 483.990968] RIP: 0033:0x7ff84077a687
[ 483.990969] RSP: 002b:00007fff343b4a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 483.990971] RAX: ffffffffffffffda RBX: 000055ab07c8d010 RCX: 00007ff84077a687
[ 483.990972] RDX: 0000000000000014 RSI: 00000000000001ff RDI: 000055ab0b26de70
[ 483.990973] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 483.990974] R10: 000055ab0b0e1f38 R11: 0000000000000246 R12: 000055ab084ced58
[ 483.990975] R13: 000055ab0b222fd0 R14: 000055ab0b26de70 R15: 00000000000001ff
显然 proxmox 使用Proxmox 集群文件系统它会挂载在 /etc/pve 中,所以这肯定是网络问题。我可以双向 ping 两个节点。
root@s1:~# pvecm status
Quorum information
------------------
Date: Sun Jun 23 07:11:24 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/267728
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 4 10.0.0.5 (local)
root@s2:~# pvecm status
Quorum information
------------------
Date: Sun Jun 23 07:14:11 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2/192400
Quorate: No
Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 2 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.0.6 (local)
root@s1:~# pveversion --verbose
proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.11-1-pve: 4.10.11-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
我测试了两个节点之间的连通性。结果如下,所以我想我们可以得出结论,多播是有效的。
root@s1:~# omping -m 239.192.109.7 -c 600 -i 1 -F -q s2 s1
s2 : waiting for response msg
s2 : waiting for response msg
s2 : joined (S,G) = (*, 239.192.109.7), pinging
s2 : given amount of query messages was sent
s2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.185/0.265/0.387/0.018
s2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.192/0.273/0.400/0.019
root@s2:~# omping -m 239.192.109.7 -c 600 -i 1 -F -q s2 s1
s1 : waiting for response msg
s1 : joined (S,G) = (*, 239.192.109.7), pinging
s1 : given amount of query messages was sent
s1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.164/0.345/0.390/0.020
s1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.183/0.369/0.410/0.020
hosts 文件内容如下
root@s1:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.0.0.5 s1 pvelocalhost
10.0.0.6 s2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
和
root@s2:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.0.0.6 s2 pvelocalhost
10.0.0.5 s1
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
corosync 服务正在运行(s2 中也是一样)
root@s1:~# journalctl -u corosync.service --no-pager
-- Logs begin at Sat 2019-06-22 17:05:48 EEST, end at Sat 2019-06-22 17:47:20 EEST. --
Jun 22 17:05:53 s1 systemd[1]: Starting Corosync Cluster Engine...
Jun 22 17:05:53 s1 corosync[2713]: [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Jun 22 17:05:53 s1 corosync[2713]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Jun 22 17:05:53 s1 corosync[2713]: notice [MAIN ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Jun 22 17:05:53 s1 corosync[2713]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Jun 22 17:05:54 s1 corosync[2713]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jun 22 17:05:54 s1 corosync[2713]: warning [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jun 22 17:05:54 s1 corosync[2713]: warning [MAIN ] Please migrate config file to nodelist.
Jun 22 17:05:54 s1 corosync[2713]: [MAIN ] Please migrate config file to nodelist.
Jun 22 17:05:54 s1 corosync[2713]: notice [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 22 17:05:54 s1 corosync[2713]: notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 22 17:05:54 s1 corosync[2713]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 22 17:05:54 s1 corosync[2713]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 22 17:05:54 s1 corosync[2713]: notice [TOTEM ] The network interface [10.0.0.5] is now up.
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync configuration map access [0]
Jun 22 17:05:54 s1 corosync[2713]: info [QB ] server name: cmap
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync configuration service [1]
Jun 22 17:05:54 s1 corosync[2713]: info [QB ] server name: cfg
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 22 17:05:54 s1 corosync[2713]: info [QB ] server name: cpg
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync profile loading service [4]
Jun 22 17:05:54 s1 corosync[2713]: [TOTEM ] The network interface [10.0.0.5] is now up.
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync resource monitoring service [6]
Jun 22 17:05:54 s1 corosync[2713]: warning [WD ] Watchdog not enabled by configuration
Jun 22 17:05:54 s1 corosync[2713]: warning [WD ] resource load_15min missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: warning [WD ] resource memory_used missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: info [WD ] no resources configured.
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
Jun 22 17:05:54 s1 corosync[2713]: notice [QUORUM] Using quorum provider corosync_votequorum
Jun 22 17:05:54 s1 corosync[2713]: notice [QUORUM] This node is within the primary component and will provide service.
Jun 22 17:05:54 s1 corosync[2713]: notice [QUORUM] Members[0]:
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 22 17:05:54 s1 corosync[2713]: info [QB ] server name: votequorum
Jun 22 17:05:54 s1 corosync[2713]: notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 22 17:05:54 s1 corosync[2713]: info [QB ] server name: quorum
Jun 22 17:05:54 s1 corosync[2713]: notice [TOTEM ] A new membership (10.0.0.5:182116) was formed. Members joined: 1
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jun 22 17:05:54 s1 systemd[1]: Started Corosync Cluster Engine.
Jun 22 17:05:54 s1 corosync[2713]: warning [CPG ] downlist left_list: 0 received
Jun 22 17:05:54 s1 corosync[2713]: notice [QUORUM] Members[1]: 1
Jun 22 17:05:54 s1 corosync[2713]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jun 22 17:05:54 s1 corosync[2713]: [QB ] server name: cmap
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync configuration service [1]
Jun 22 17:05:54 s1 corosync[2713]: [QB ] server name: cfg
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 22 17:05:54 s1 corosync[2713]: [QB ] server name: cpg
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync profile loading service [4]
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Jun 22 17:05:54 s1 corosync[2713]: [WD ] Watchdog not enabled by configuration
Jun 22 17:05:54 s1 corosync[2713]: [WD ] resource load_15min missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: [WD ] resource memory_used missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: [WD ] no resources configured.
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync watchdog service [7]
Jun 22 17:05:54 s1 corosync[2713]: [QUORUM] Using quorum provider corosync_votequorum
Jun 22 17:05:54 s1 corosync[2713]: [QUORUM] This node is within the primary component and will provide service.
Jun 22 17:05:54 s1 corosync[2713]: [QUORUM] Members[0]:
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 22 17:05:54 s1 corosync[2713]: [QB ] server name: votequorum
Jun 22 17:05:54 s1 corosync[2713]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 22 17:05:54 s1 corosync[2713]: [QB ] server name: quorum
Jun 22 17:05:54 s1 corosync[2713]: [TOTEM ] A new membership (10.0.0.5:182116) was formed. Members joined: 1
Jun 22 17:05:54 s1 corosync[2713]: [CPG ] downlist left_list: 0 received
Jun 22 17:05:54 s1 corosync[2713]: [QUORUM] Members[1]: 1
Jun 22 17:05:54 s1 corosync[2713]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 22 17:26:40 s1 corosync[2713]: notice [TOTEM ] A new membership (10.0.0.5:184780) was formed. Members
Jun 22 17:26:40 s1 corosync[2713]: [TOTEM ] A new membership (10.0.0.5:184780) was formed. Members
Jun 22 17:26:40 s1 corosync[2713]: warning [CPG ] downlist left_list: 0 received
Jun 22 17:26:40 s1 corosync[2713]: notice [QUORUM] Members[1]: 1
Jun 22 17:26:40 s1 corosync[2713]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jun 22 17:26:40 s1 corosync[2713]: [CPG ] downlist left_list: 0 received
Jun 22 17:26:40 s1 corosync[2713]: [QUORUM] Members[1]: 1
Jun 22 17:26:40 s1 corosync[2713]: [MAIN ] Completed service synchronization, ready to provide service.
tcpdump 显示端口 5404 上有活动,所以我的结论是两个节点进行通信
root@s1:~# tcpdump port 5404 | grep -v "192\.168\.0\.7"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:54:05.306075 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:05.609111 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:05.912145 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:06.014427 IP s2.5404 239.192.109.7.5405: UDP, length 296
17:54:06.215173 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:06.518208 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:06.821242 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:07.124277 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:07.427312 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:07.730347 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:07.875423 IP s1.5404 239.192.109.7.5405: UDP, length 88
17:54:08.076147 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:08.316885 IP s2.5404 239.192.109.7.5405: UDP, length 296
17:54:08.379755 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:08.682792 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:08.985856 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:54:09.288923 IP s1.5404 239.192.109.7.5405: UDP, length 136
^C121 packets captured
133 packets received by filter
0 packets dropped by kernel
root@s2:~# tcpdump port 5404 | grep -v "192\.168\.0\.7"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp2s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:53:31.114024 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:31.413210 IP s2.5404 239.192.109.7.5405: UDP, length 296
17:53:31.417049 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:31.720082 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:32.023114 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:32.326150 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:32.629171 IP s1.5404 239.192.109.7.5405: UDP, length 136
17:53:32.883822 IP s1.5404 239.192.109.7.5405: UDP, length 88
^C86 packets captured
110 packets received by filter
0 packets dropped by kernel
pve-cluster 服务在 s2 上显示一些错误
root@s1:~# journalctl -u pve-cluster --no-pager
-- Logs begin at Sat 2019-06-22 17:05:48 EEST, end at Sat 2019-06-22 18:00:20 EEST. --
Jun 22 17:05:51 s1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jun 22 17:05:51 s1 pmxcfs[2637]: [quorum] crit: quorum_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [quorum] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [confdb] crit: cmap_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [confdb] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [dcdb] crit: cpg_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [dcdb] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [status] crit: cpg_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [status] crit: can't initialize service
Jun 22 17:05:53 s1 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: update cluster info (cluster name AdvaitaCluster1, version = 8)
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: node has quorum
Jun 22 17:05:57 s1 pmxcfs[2637]: [dcdb] notice: members: 1/2637
Jun 22 17:05:57 s1 pmxcfs[2637]: [dcdb] notice: all data is up to date
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: members: 1/2637
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: all data is up to date
root@s2:~# journalctl -u pve-cluster --no-pager
[...]
Jun 22 18:01:46 s2 pmxcfs[15830]: [status] crit: cpg_send_message failed: 6
Jun 22 18:01:47 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 10
Jun 22 18:01:48 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 20
Jun 22 18:01:49 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 30
Jun 22 18:01:50 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 40
Jun 22 18:01:51 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 50
Jun 22 18:01:52 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 60
Jun 22 18:01:53 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 70
Jun 22 18:01:54 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 80
Jun 22 18:01:54 s2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jun 22 18:01:54 s2 pmxcfs[15830]: [main] notice: teardown filesystem
Jun 22 18:01:55 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 90
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 100
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] notice: cpg_send_message retried 100 times
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] crit: cpg_send_message failed: 6
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Killing process 15830 (pmxcfs) with signal SIGKILL.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jun 22 18:02:04 s2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Unit entered failed state.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jun 22 18:02:04 s2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jun 22 18:02:04 s2 pmxcfs[30809]: [status] notice: update cluster info (cluster name AdvaitaCluster1, version = 7)
Jun 22 18:02:06 s2 systemd[1]: Started The Proxmox VE cluster filesystem.
pve-firewall 未启用。
答案1
这是我为了让事情正常运转所做的事情。一定有更好的方法。
1. 摆脱旧集群
我按照说明这里删除节点
停止服务
systemctl stop corosync
systemctl stop pve-cluster
以本地模式启动
pmxcfs -l
创建一个备份文件夹,并备份他们说要在两个节点上删除的内容
cd ~
mkdir backup-pve-2019-06-23-07-34
mv /etc/pve/corosync.conf backup-pve-2019-06-23-07-34/
mkdir backup-pve-2019-06-23-07-34/etc/corosync -p
mv /etc/corosync/* backup-pve-2019-06-23-07-34/etc/corosync/
mkdir backup-pve-2019-06-23-07-34/var/lib/corosync/ -p
mv /var/lib/corosync/* backup-pve-2019-06-23-07-34/var/lib/corosync/
对于下一步操作,需要安装 /etc/pve
killall pmxcfs
systemctl start pve-cluster
pvecm expected 1
mkdir backup-pve-2019-06-23-07-34/etc/pve/nodes -p
mv /etc/pve/nodes/s1 backup-pve-2019-06-23-07-34/etc/pve/nodes/
如果集群中有容器,则无法将节点添加到集群,因此,在其中一个节点(例如,s2)上备份并销毁所有容器。
root@s2:~# vzdump 100 101 ...
root@s2:~# pct destroy 100
root@s2:~# pct destroy 101
root@s2:~# ...
2.创建新集群
在其中一个节点(保存容器的节点)上创建集群
root@s1:~# pvecm create NewClusterName
添加另一个
root@s2:~# pvecm add 10.0.0.5
作为一款优秀的软件,它会卡住,waiting for quorum...
因此请按 CTRL+C 退出并重新启动两个节点。
查看存储状态,以便了解哪个有足够的空间
root@s2:~# pvesm status
现在恢复容器(用您之前决定的存储替换本地;当然,文件名会有所不同)
root@s2:~# pct restore 100 /var/lib/vz/dump/vzdump-lxc-100-2019_06_23-07_51_29.tar -storage local
root@s2:~# pct restore 101 /var/lib/vz/dump/vzdump-lxc-101-2019_06_23-07_51_29.tar -storage local
root@s2:~# ...