mount.ocfs2:安装时传输端点未连接……?

mount.ocfs2:安装时传输端点未连接……?

我已使用 OCFS2 替换了以双主模式运行的死节点。所有步骤均有效:

/proc/drbd

version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by [email protected], 2012-05-07 11:56:36

 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

直到我尝试安装该卷:

mount -t ocfs2 /dev/drbd1 /data/webroot/
mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.

/var/log/kern.log

kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107
kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107
kernel: ocfs2: Unmounting device (147,1) on (node 1)

以下是节点 0(192.168.3.145)上的内核日志:

kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0
kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93
:43868
kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107

我确信/etc/ocfs2/cluster.conf两个节点是相同的:

/etc/ocfs2/cluster.conf

node:
    ip_port = 7777
    ip_address = 192.168.3.145
    number = 0
    name = SVR233NTC-3145.localdomain
    cluster = cpc

node:
    ip_port = 7777
    ip_address = 192.168.2.93
    number = 1
    name = SVR022-293.localdomain
    cluster = cpc

cluster:
    node_count = 2
    name = cpc

并且它们连接得很好:

# nc -z 192.168.3.145 7777
Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!

但新节点 (192.168.2.93) 上的 O2CB 心跳未处于活动状态:

/etc/init.d/o2cb status

Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster cpc: Online
Heartbeat dead threshold = 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Not active

tcpdump以下是在节点 1 上启动并在节点 0 上运行时的结果ocfs2

  1   0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0
  2   0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180
  3   0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223
  4   0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223
  5   0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181
  6   0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181

RST每 6 个数据包后发送一次标志。

我还能做什么来调试这个案例?

附言:

节点 0 上的 OCFS2 版本:

  • ocfs2-工具-1.4.4-1.el5
  • ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

节点 1 上的 OCFS2 版本:

  • ocfs2-工具-1.4.4-1.el5
  • ocfs2-2.6.18-308.el5-1.4.7-1.el5

更新 1 - 2012 年 12 月 23 日星期日 18:15:07 ICT

两个节点是否位于同一 LAN 网段?没有路由器等?

不,它们是不同子网上的 2 台 VMWare 服务器。

哦,我记得——主机名/DNS 都设置正确并且工作正常吗?

当然,我将每个节点的主机名和 IP 地址都添加到/etc/hosts

192.168.2.93    SVR022-293.localdomain
192.168.3.145   SVR233NTC-3145.localdomain

并且它们可以通过主机名互相连接:

# nc -z SVR022-293.localdomain 7777
Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded!

# nc -z SVR233NTC-3145.localdomain 7777
Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!

更新 2 - 2012 年 12 月 24 日星期一 18:32:15 ICT

找到线索了:我的同事/etc/ocfs2/cluster.conf在集群运行时手动编辑了该文件。因此,它仍然保留了死节点信息/sys/kernel/config/cluster/

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain
drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain

SVR150-4107.localdomain在这种情况下)

我将停止集群以删除死节点,但出现以下错误:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active

我确信该ocfs2服务已经停止:

# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sdb              ocfs2  Not mounted
/dev/drbd1            ocfs2  Not mounted

不再有参考:

# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26
12963EAF4E16484DB81ECB0251177C26: 0 refs

我还卸载了ocfs2内核模块以确保:

# ps -ef | grep [o]cfs2
root     12513    43  0 18:25 ?        00:00:00 [ocfs2_wq]

# modprobe -r ocfs2
# ps -ef | grep [o]cfs2
# lsof | grep ocfs2

但没有任何变化:

# /etc/init.d/o2cb offline
Stopping O2CB cluster cpc: Failed
Unable to stop cluster as heartbeat region still active

所以最后一个问题是:如何删除死亡节点信息无需重启


更新 3 - 2012 年 12 月 24 日星期一 22:41:51 ICT

以下是所有正在运行的心跳线程:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d'
drwxr-xr-x 2 root root    0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2

该心跳区域的引用计数:

# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs

嘗試殺掉:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

有任何想法吗?

答案1

噢耶!问题解决了。

注意UUID:

# mounted.ocfs2 -d
Device                FS     Stack  UUID                              Label
/dev/sdb              ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1
/dev/drbd1            ocfs2  o2cb   12963EAF4E16484DB81ECB0251177C26  ocfs2_drbd1

但:

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/
drwxr-xr-x 2 root root    0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

这可能是因为我“意外”强制重新格式化了 OCFS2 卷。我面临的问题类似于在 Ocfs2-user 邮件列表上。

这也是以下错误的原因:

ocfs2_hb_ctl:停止心跳时 ocfs2_lookup 未找到文件

因为无法在中ocfs2_hb_ctl找到具有 UUID 的设备。72EF09EA3D0D4F51BDC00B47432B1EB2/proc/partitions

我想到了一个想法:我可以更改 OCFS2 卷的 UUID 吗

查看tunefs.ocfs2手册页:

Usage: tunefs.ocfs2 [options] <device> [new-size]
       tunefs.ocfs2 -h|--help
       tunefs.ocfs2 -V|--version
[options] can be any mix of:
        -U|--uuid-reset[=new-uuid]

所以我执行以下命令:

# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1
WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. 
Having two OCFS2 file systems with the same UUID could, in the least, 
cause erratic behavior, and if unlucky, cause file system damage. 
Please choose the UUID with care.
Update the UUID ?yes

核实:

# tunefs.ocfs2 -Q "%U\n" /dev/drbd1 
72EF09EA3D0D4F51BDC00B47432B1EB2

尝试再次杀死心跳区域以查看会发生什么:

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2
# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2
72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

继续杀戮,直到我看到然后0 refs关闭集群:

# /etc/init.d/o2cb offline cpc
Stopping O2CB cluster cpc: OK

并停止它:

# /etc/init.d/o2cb stop
Stopping O2CB cluster cpc: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

重新启动以查看新节点是否已更新:

# /etc/init.d/o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster cpc: OK

# ls -l /sys/kernel/config/cluster/cpc/node/
total 0
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain
drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

OK,在对等节点(192.168.2.93)上,尝试启动OCFS2:

# /etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2)                [  OK  ]

谢谢苏尼尔·穆什兰因为线程帮助我解决了这个问题。

教训是:

  1. IP 地址、端口等只能在集群离线时更改。请参阅 常问问题
  2. 切勿强制重新格式化 OCFS2 卷。

相关内容