我在生产环境中拥有一个 consul 集群,它集成了 terraform、ansible、nomad、docker、vault。即使从 raft 中删除了条目,我的 consul 仍在寻找初始设置中存在的死节点peers.json
。以下是我的 consul 日志。
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: 'ip-10-10-2-49'
Datacenter: 'us-east-1'
Server: true (bootstrap: false)
Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
Cluster Addr: 10.10.2.49 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
Atlas: <disabled>
==> Log data will now stream in as it occurs:
2017/01/24 13:52:59 [INFO] raft: Restored from snapshot 1596-4460452-1485234581011
2017/01/24 13:52:59 [INFO] raft: Node at 10.10.2.49:8300 [Follower] entering Follower state
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49 10.10.2.49
2017/01/24 13:52:59 [INFO] serf: Attempting re-join to previously known node: ip-10-0-1-206: 10.0.1.206:8301
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-2-49 (Addr: 10.10.2.49:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-4-149 10.10.4.149
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-10 10.10.1.10
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-3-119 10.0.3.119
2017/01/24 13:52:59 [WARN] memberlist: Refuting an alive message
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-3-84 10.10.3.84
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-0-1-206 10.0.1.206
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-1-252 10.10.1.252
2017/01/24 13:52:59 [INFO] serf: Re-joined to previously known node: ip-10-0-1-206: 10.0.1.206:8301
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-1-10 (Addr: 10.10.1.10:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] consul: adding LAN server ip-10-10-3-84 (Addr: 10.10.3.84:8300) (DC: us-east-1)
2017/01/24 13:52:59 [INFO] serf: EventMemberJoin: ip-10-10-2-49.us-east-1 10.10.2.49
2017/01/24 13:52:59 [INFO] consul: adding WAN server ip-10-10-2-49.us-east-1 (Addr: 10.10.2.49:8300) (DC: us-east-1)
2017/01/24 13:52:59 [WARN] serf: Failed to re-join any previously known node
2017/01/24 13:52:59 [INFO] agent: Joining cluster...
2017/01/24 13:52:59 [ERR] agent: failed to sync remote state: No cluster leader
2017/01/24 13:52:59 [INFO] agent: (LAN) joining: [consul-1.example-private.com consul-2.example-private.com consul-3.example-private.com]
2017/01/24 13:52:59 [INFO] agent: (LAN) joined: 3 Err: <nil>
2017/01/24 13:52:59 [INFO] agent: Join completed. Synced with 3 initial agents
2017/01/24 13:53:01 [WARN] raft: Heartbeat timeout reached, starting election
2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Candidate] entering Candidate state
2017/01/24 13:53:01 [INFO] raft: Election won. Tally: 3
2017/01/24 13:53:01 [INFO] raft: Node at 10.10.2.49:8300 [Leader] entering Leader state
2017/01/24 13:53:01 [INFO] consul: cluster leadership acquired
2017/01/24 13:53:01 [INFO] consul: New leader elected: ip-10-10-2-49
2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.3.84:8300
2017/01/24 13:53:01 [INFO] raft: pipelining replication to peer 10.10.1.10:8300
2017/01/24 13:53:01 [WARN] raft: Failed to contact 10.10.1.23:8300 in 501.633573ms
2017/01/24 13:53:02 [INFO] agent: Synced node info
2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 961.388392ms
2017/01/24 13:53:02 [WARN] raft: Failed to contact 10.10.1.23:8300 in 1.42262185s
2017/01/24 13:53:11 [ERR] raft: Failed to make RequestVote RPC to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
2017/01/24 13:53:11 [ERR] raft: Failed to AppendEntries to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
2017/01/24 13:53:11 [ERR] raft: Failed to heartbeat to 10.10.1.23:8300: dial tcp 10.10.1.23:8300: i/o timeout
我尝试从 peers.json 中删除死节点的条目10.10.1.23
,然后重新启动它,但它一直在寻找相同的死节点。有人可以指导我如何踢出死节点吗?我尝试了 consul 文档中概述的所有基本命令来踢出这个特定节点,但是服务重新启动后它开始出现在日志中。
答案1
要删除仍处于集群状态的死节点,您可以consul force-leave <node name>
从活动节点调用。这将使该节点处于集群中的“离开”状态,如果该节点正常离开集群,就会发生这种情况。