假设我配置了两个 DRBD 设备。当第二个节点连接时,它会同步第一个(主/主控)节点的数据。
在此同步期间,主节点断电。
当 Primary 节点丢失后,如果原来的 Secondary 节点是唯一可用的节点,则处于 Secondary 节点Inconsistent/DUnknown
状态。
有什么方法可以自动恢复吗?
version: 8.4.7 (api:1/proto:86-101)
srcversion: 0904DF2CCF7283ACE07D07A
1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:390452
我可以通过运行drbdadm promote --force <resource-name>
然后手动从这种情况中恢复(这是在起搏器集群中),pcs resource cleanup
但我正在寻找一种自动触发此恢复的方法。
示例的完整日志
[ 20.233788] drbd: initialized. Version: 8.4.7 (api:1/proto:86-101)
[ 20.234905] drbd: srcversion: 0904DF2CCF7283ACE07D07A
[ 20.235791] drbd: registered as block device major 147
[ 22.402786] drbd shareddata: Starting worker thread (from drbdsetup-84 [1406])
[ 22.406433] block drbd1: disk( Diskless -> Attaching )
[ 22.407422] drbd shareddata: Method to ensure write ordering: flush
[ 22.408478] block drbd1: max BIO size = 4096
[ 22.409211] block drbd1: drbd_bm_resize called with capacity == 2097016
[ 22.410317] block drbd1: resync bitmap: bits=262127 words=4096 pages=8
[ 22.411492] block drbd1: size = 1024 MB (1048508 KB)
[ 22.413787] block drbd1: recounting of set bits took additional 0 jiffies
[ 22.414922] block drbd1: 1024 MB (262127 bits) marked out-of-sync by on disk bit-map.
[ 22.416189] block drbd1: Suspended AL updates
[ 22.416942] block drbd1: disk( Attaching -> UpToDate )
[ 22.418403] block drbd1: attached to UUIDs 9FB19F9A9D6573A9:0000000000000004:0000000000000000:0000000000000000
[ 22.460721] drbd shareddata: conn( StandAlone -> Unconnected )
[ 22.462303] drbd shareddata: Starting receiver thread (from drbd_w_sharedda [1407])
[ 22.467153] drbd shareddata: receiver (re)started
[ 22.468715] drbd shareddata: conn( Unconnected -> WFConnection )
[ 23.000120] drbd shareddata: Handshake successful: Agreed network protocol version 101
[ 23.003987] drbd shareddata: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
[ 23.008195] drbd shareddata: conn( WFConnection -> WFReportParams )
[ 23.010706] drbd shareddata: Starting ack_recv thread (from drbd_r_sharedda [1467])
[ 23.067880] block drbd1: max BIO size = 1048576
[ 23.069557] block drbd1: drbd_sync_handshake:
[ 23.070869] block drbd1: self 9FB19F9A9D6573A8:0000000000000004:0000000000000000:0000000000000000 bits:262127 flags:0
[ 23.073539] block drbd1: peer 3B5A831140811725:0000000000000004:0000000000000000:0000000000000000 bits:262127 flags:0
[ 23.076210] block drbd1: uuid_compare()=100 by rule 90
[ 23.077596] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1
[ 23.081505] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
[ 23.084035] block drbd1: Split-Brain detected, 1 primaries, automatically solved. Sync from peer node
[ 23.086539] block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
[ 23.089588] block drbd1: Resumed AL updates
[ 23.103227] block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
[ 23.105986] block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
[ 23.108662] block drbd1: conn( WFBitMapT -> WFSyncUUID )
[ 23.127823] block drbd1: updated sync uuid 68A55F3E62EDE97C:0000000000000000:0000000000000000:0000000000000000
[ 23.136222] block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
[ 23.140260] block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
[ 23.142823] block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
[ 23.145214] block drbd1: Began resync as SyncTarget (will sync 1048508 KB [262127 bits set]).
[ 61.912243] drbd shareddata: PingAck did not arrive in time.
[ 61.914470] drbd shareddata: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[ 61.919882] drbd shareddata: ack_receiver terminated
[ 61.921491] drbd shareddata: Terminating drbd_a_sharedda
[ 61.968612] drbd shareddata: Connection closed
[ 61.970170] drbd shareddata: conn( NetworkFailure -> Unconnected )
[ 61.971855] drbd shareddata: receiver terminated
[ 61.973304] drbd shareddata: Restarting receiver thread
[ 61.974743] drbd shareddata: receiver (re)started
[ 61.976187] drbd shareddata: conn( Unconnected -> WFConnection )
[ 62.008237] block drbd1: State change failed: Need access to UpToDate data
[ 62.010446] block drbd1: state = { cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown r----- }
[ 62.013170] block drbd1: wanted = { cs:WFConnection ro:Primary/Unknown ds:Inconsistent/DUnknown r----- }
[ 76.334863] drbd shareddata: conn( WFConnection -> Disconnecting )
[ 76.336529] drbd shareddata: Discarding network configuration.
[ 76.338082] drbd shareddata: Connection closed
[ 76.339375] drbd shareddata: conn( Disconnecting -> StandAlone )
[ 76.340898] drbd shareddata: receiver terminated
[ 76.342203] drbd shareddata: Terminating drbd_r_sharedda
[ 76.343712] block drbd1: disk( Inconsistent -> Failed )
[ 76.364417] block drbd1: 560 MB (143363 bits) marked out-of-sync by on disk bit-map.
[ 76.366742] block drbd1: disk( Failed -> Diskless )
[ 76.404579] drbd shareddata: Terminating drbd_w_sharedda
答案1
如果您不关心数据,为什么首先要复制它?;)
由于这是初始同步,您的辅助节点将拥有Inconsistent
数据,直到同步完成。在此之前,您总是必须强制将辅助节点升级为主节点,这不是一件好事。
为什么不跳过初始同步,然后使用 DRBD 的 LVM 快照before-resync-target
处理程序来防止这种情况发生?
要跳过初始同步,一旦在两个节点上都建立了新设备,并且它们是cs:Connected
和ds:Inconsistent/Inconsistent
,请清除位图以使当前状态“一致”(从一个节点,而不是两个节点):
# drbdadm new-current-uuid --clear-bitmap all
然后,使用 DRBD before-resync-target
/after-resync-target
处理程序在重新同步之前/之后拍摄/删除备份 LVM 设备的快照,以便在重新同步期间发生故障时始终拥有一致的数据集:
resource <resource> {
...
handlers {
before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh";
after-resync-target "/usr/lib/drbd/unsnapshot-resync-target-lvm.sh";
}
}
然后您就可以像使用lvconvert
任何其他 lvm 快照一样恢复该快照。