Ganeti 磁盘降级 drbd cs:NetworkFailure

Ganeti 磁盘降级 drbd cs:NetworkFailure

我在 Ganeti 上有一个实例(有 2 个磁盘),两个磁盘都降级了(可能是由于连接问题?)。这个实例多年来一直运行正常,直到今天早上。

在我的主人身上

$ gnt-instance info myinstance
...
   -disk/0
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

...

在主节点上

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

在辅助节点上

$ cat /proc/drbd
 4: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:678340009 dw:678340009 dr:0 al:0 bm:14884 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

我无法重新启动或关闭实例(操作超时)。

我认为这不是大脑分裂问题,因为没有“独立”,并且在主节点上它是“主/未知”,而在辅助节点上它是“次要/未知”。

我尝试在辅助节点上运行“drbdadm connect all”,但没有任何反应。

我尝试更换磁盘但失败了:

gnt-instance replace-disks -s myinstance
Thu Jun  2 11:32:00 2016 Replacing disk(s) 0, 1 for myinstancel
Thu Jun  2 11:36:00 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=False, pass=1): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Thu Jun  2 11:38:01 2016  - WARNING: Could not prepare block device disk/0 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd4: cannot activate, unknown or unhandled reason
Thu Jun  2 11:40:02 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Failure: command execution error:
Disk consistency error

现在看起来像这样:

$ gnt-instance info myinstance
...
    -disk/0 
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      (no more secondary)
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

在主节点上

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

在辅助节点上:

$ cat /proc/drbd
...
4: cs:Unconfigured
5: cs:Unconfigured

知道如何解决这个问题吗?

DRBD 版本:8.3.7

Ganeti版本:2.4.5

操作系统:Debian 6.0

答案1

经过进一步调查后,我发现主节点上存在 kvm 僵尸进程:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                      
17520 root    20   0     0    0    0 Z  613  0.0  13922:24 kvm <defunct> 

我不确定如何正确地摆脱它。

我尝试从此节点迁移所有主实例(我只有 2 个),但失败了(与 bdrm 相关的错误)。我重新启动了节点。关闭机器时,由于 drbd,它卡住了。消息如下:

No response from the DRBD driver! Is the module loaded?

于是我按下了关闭机器的按钮。机器重新启动(没有任何错误),几分钟后 Ganeti 实例自动启动。

在主节点上我运行了:

$ gnt-instance info myinstance
...
     on primary:   /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 288s, status *DEGRADED*
     on secondary: /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 275s, status *DEGRADED* *UNCERTAIN STATE*
....

等待几分钟后,恢复完成,现在已同步。

结论:现在一切都正常,但我希望不必重新启动节点。

感谢 gf_ 的帮助。

相关内容