我已经设置了一个 Kubernetes 集群,该集群包含 2 个主节点(cp01 192.168.1.42、cp02 192.168.1.46)和 4 个工作节点,使用 haproxy 和 keepalived 作为集群中的静态 pod 运行,内部 etcd 集群。出于某些愚蠢的原因,我意外地在 cp01 上重置了 kubeadm -f。现在我尝试使用 kubeadm join 命令重新加入集群,但我一直收到 dial tcp 192.168.1.49:8443: connect: connection denied,其中 192.168.1.49 是 LoadBalancer IP。请帮忙!以下是当前配置。
cp02 上的 /etc/haproxy/haproxy.cfg
defaults
timeout connect 10s
timeout client 30s
timeout server 30s
frontend apiserver
bind *.8443
mode tcp
option tcplog
default_backend apiserver
backend apiserver
option httpchk GET /healthz
http-check expect status 200
mode tcp
option ssl-hello-chk
balance roundrobin
default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
#server master01 192.168.1.42:6443 check ***the one i accidentally resetted
server master02 192.168.1.46:6443 check
cp02 上的 /etc/keepalived/keepalived.conf
global_defs {
router_id LVS_DEVEL
script_user root
enable_script_security
dynamic_interfaces
}
vrrp_script check_apiserver {
script "/etc/keepalived/check_apiserver.sh"
interval 3
weight -2
fall 10
rise 2
}
vrrp_instance VI_l {
state BACKUP
interface ens192
virtual_router_id 51
priority 101
authentication {
auth_type PASS
auth_pass ***
}
virtual_ipaddress {
192.168.1.49/24
}
track_script {
check_apiserver
}
}
集群 kubeadm-config
apiVersion: v1
data:
ClusterConfiguration: |
apiServer:
extraArgs:
authorization-mode: Node,RBAC
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta2
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controlPlaneEndpoint: 192.168.1.49:8443
controllerManager: {}
dns:
type: CoreDNS
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: k8s.gcr.io
kind: ClusterConfiguration
kubernetesVersion: v1.19.2
networking:
dnsDomain: cluster.local
podSubnet: 10.244.0.0/16
serviceSubnet: 10.96.0.0/12
scheduler: {}
ClusterStatus: |
apiEndpoints:
cp02:
advertiseAddress: 192.168.1.46
bindPort: 6443
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterStatus
...
kubectl 集群信息
Kubernetes master is running at https://192.168.1.49:8443
KubeDNS is running at https://192.168.1.49:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
更多信息
集群在 cp01 上使用 --upload-certs 进行初始化。
我从集群中清空并删除了 cp01。
kubeadm join --token ... --discovery-token-ca-cert-hash ... --control-plane --certificate-key ...
命令返回:error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Get "https://192.168.1.49:8443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.1.49:8443: connect: connection refused
kubectl exec -n kube-system -it etcd-cp02 -- etcdctl --endpoints=https://192.168.1.46:2379 --key=/etc/kubernetes/pki/etcd/peer.key --cert=/etc/kubernetes/pki/etcd/peer.crt --cacert=/etc/kubernetes/pki/etcd/ca.crt member list
回到:..., started, cp02, https://192.168.1.46:2380, https://192.168.1.46:2379, false
kubectl describe pod/etcd-cp02 -n kube-system
:... Container ID: docker://... Image: k8s.gcr.io/etcd:3.4.13-0 Image ID: docker://... Port: <none> Host Port: <none> Command: etcd --advertise-client-urls=https://192.168.1.46:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://192.168.1.46:2380 --initial-cluster=cp01=https://192.168.1.42:2380,cp02=https://192.168.1.46:2380 --initial-cluster-state=existing --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://192.168.1.46:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://192.168.1.46:2380 --name=cp02 --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt ...
cp01:/etc/kubernetes/pki
尝试在运行之前 复制证书kubeadm join 192.168.1.49:8443 --token ... --discovery-token-ca-cert-hash
但返回相同的错误。# files copied over to cp01 ca.crt ca.key sa.key sa.pub front-proxy-ca.crt front-proxy-ca.key etcd/ca.crt etcd/ca.key
排除网络故障
能够在 cp01 上 ping 192.168.1.49
nc -v 192.168.1.49 8443
cp01 返回Ncat: Connection refused.
curl -k https://192.168.1.49:8443/api/v1...
在 cp02 和工作节点上运行(返回代码 403,这应该是正常的)。/etc/cni/net.d/ 在 cp01 上被删除
使用“KUBE”或“cali”手动清除 cp01 上的 iptables 规则。
cp01 和 cp02 上的firewalld 均被禁用。
我尝试加入新服务器 cp03 192.168.1.48 并遇到相同的拨号 tcp 192.168.1.49:8443: connect: 连接被拒绝错误。
netstat -tlnp | grep 8443
在cp02上返回:tcp 0 0.0.0.0:8443 0.0.0.0:* LISTEN 27316/haproxy
nc -v 192.168.1.46 6443
cp01 和 cp03 返回:Ncat: Connected to 192.168.1.46:6443
任何建议/指导都将不胜感激,因为我对此感到茫然。我想这可能是由于 cp02 上的网络规则,但我真的不知道如何检查。谢谢!!
答案1
当我输入时弄清楚了问题是什么ip a
。意识到cp01上的ens192仍然包含辅助ip地址192.168.1.49。
简单ip addr del 192.168.1.49/24 dev ens192
来说kubeadm join...
,cp01 能够成功重新加入集群。难以置信我竟然错过了这个机会...