集群信息:
Kubernetes version: v1.28.2
Cloud being used: Virtualbox
Installation method: Kubernetes Cluster VirtualBox
Host OS: Ubuntu 22.04.3 LTS
CNI and version: calico
CRI and version: containerd://1.7.2
集群包含 1 个 Master 节点和 2 个 Worker 节点。集群启动一段时间(启动后 1-2 分钟)看起来不错:
lab@master:~$ kubectl -nkube-system get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l 1/1 Running 12 (2m11s ago) 13d 10.10.219.98 master <none> <none>
calico-node-bqlnm 1/1 Running 3 (2m11s ago) 4d2h 192.168.1.164 master <none> <none>
calico-node-mrd86 1/1 Running 105 (2d20h ago) 4d2h 192.168.1.165 worker01 <none> <none>
calico-node-r6w9s 1/1 Running 110 (2d20h ago) 4d2h 192.168.1.166 worker02 <none> <none>
coredns-5dd5756b68-njtpf 1/1 Running 11 (2m11s ago) 13d 10.10.219.100 master <none> <none>
coredns-5dd5756b68-pxn8l 1/1 Running 10 (2m11s ago) 13d 10.10.219.99 master <none> <none>
etcd-master 1/1 Running 67 (2m11s ago) 13d 192.168.1.164 master <none> <none>
kube-apiserver-master 1/1 Running 43 (2m11s ago) 13d 192.168.1.164 master <none> <none>
kube-controller-manager-master 1/1 Running 47 (2m11s ago) 13d 192.168.1.164 master <none> <none>
kube-proxy-ffnzb 1/1 Running 122 (95s ago) 12d 192.168.1.165 worker01 <none> <none>
kube-proxy-hf4mx 1/1 Running 108 (78s ago) 12d 192.168.1.166 worker02 <none> <none>
kube-proxy-ql576 1/1 Running 15 (2m11s ago) 13d 192.168.1.164 master <none> <none>
kube-scheduler-master 1/1 Running 46 (2m11s ago) 13d 192.168.1.164 master <none> <none>
metrics-server-54cb77cffd-q292x 0/1 CrashLoopBackOff 68 (18s ago) 3d21h 10.10.30.94 worker02 <none> <none>
然而,几分钟后,kube-system 命名空间中的 Pod 开始抖动/崩溃。
lab@master:~$ kubectl -nkube-system get po
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7ddc4f45bc-4qx7l 1/1 Running 12 (19m ago) 13d
calico-node-bqlnm 0/1 Running 3 (19m ago) 4d2h
calico-node-mrd86 0/1 CrashLoopBackOff 111 (2m28s ago) 4d2h
calico-node-r6w9s 0/1 CrashLoopBackOff 116 (2m15s ago) 4d2h
coredns-5dd5756b68-njtpf 1/1 Running 11 (19m ago) 13d
coredns-5dd5756b68-pxn8l 1/1 Running 10 (19m ago) 13d
etcd-master 1/1 Running 67 (19m ago) 13d
kube-apiserver-master 1/1 Running 43 (19m ago) 13d
kube-controller-manager-master 1/1 Running 47 (19m ago) 13d
kube-proxy-ffnzb 0/1 CrashLoopBackOff 127 (42s ago) 12d
kube-proxy-hf4mx 0/1 CrashLoopBackOff 113 (2m17s ago) 12d
kube-proxy-ql576 1/1 Running 15 (19m ago) 13d
kube-scheduler-master 1/1 Running 46 (19m ago) 13d
metrics-server-54cb77cffd-q292x 0/1 CrashLoopBackOff 73 (64s ago) 3d22h
我完全不清楚出了什么问题,通过检查 pod 描述,我看到重复的事件:
lab@master:~$ kubectl -nkube-system logs kube-proxy-ffnzb
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 2d20h (x50 over 3d1h) kubelet Stopping container kube-proxy
Warning BackOff 2d20h (x1146 over 3d1h) kubelet Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)
Normal Pulled 8m11s (x3 over 10m) kubelet Container image "registry.k8s.io/kube-proxy:v1.28.6" already present on machine
Normal Created 8m10s (x3 over 10m) kubelet Created container kube-proxy
Normal Started 8m10s (x3 over 10m) kubelet Started container kube-proxy
Normal SandboxChanged 6m56s (x4 over 10m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 4m41s (x4 over 10m) kubelet Stopping container kube-proxy
Warning BackOff 12s (x28 over 10m) kubelet Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)
笔记! 这种情况并不妨碍我部署一些示例部署(nginx) - 它似乎运行稳定。然而,我尝试添加指标服务器,但这个服务器崩溃了(可能与 kube-system 命名空间中的 CrashLoopBackOff pod 有关)
有什么想法可能是错误的/还有什么地方可以解决问题吗?
答案1
有人提示我检查SystemdCgroup
配置containerd
文件。下列的这个链接。
就我而言,事实证明我失踪了:/etc/containerd/config.toml
在主节点上。
- 生成它:
sudo containerd config default | sudo tee /etc/containerd/config.toml
- 下一个变化
SystemdCgroup = true
在/etc/containerd/config.toml
- 重启
containerd
服务:systemctl restart containerd
然而,这使我的集群处于以下状态:
lab@master:~$ kubectl -nkube-system get po
The connection to the server master:6443 was refused - did you specify the right host or port?
lab@master:~$ kubectl get nodes
The connection to the server master:6443 was refused - did you specify the right host or port?
我已将其恢复false
并重新启动containerd
。但是,在工作节点上我将其保留为true
.
这解决了问题:
lab@master:~$ kubectl -nkube-system get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l 1/1 Running 8 (18m ago) 14d 10.10.219.86 master <none> <none>
calico-node-c4rxp 1/1 Running 7 (14m ago) 89m 192.168.1.166 worker02 <none> <none>
calico-node-dhzr8 1/1 Running 7 (18m ago) 14d 192.168.1.164 master <none> <none>
calico-node-wqv8w 1/1 Running 1 (14m ago) 27m 192.168.1.165 worker01 <none> <none>
coredns-5dd5756b68-njtpf 1/1 Running 7 (18m ago) 14d 10.10.219.88 master <none> <none>
coredns-5dd5756b68-pxn8l 1/1 Running 6 (18m ago) 14d 10.10.219.87 master <none> <none>
etcd-master 1/1 Running 62 (18m ago) 14d 192.168.1.164 master <none> <none>
kube-apiserver-master 1/1 Running 38 (18m ago) 14d 192.168.1.164 master <none> <none>
kube-controller-manager-master 1/1 Running 42 (18m ago) 14d 192.168.1.164 master <none> <none>
kube-proxy-mgsdr 1/1 Running 7 (14m ago) 89m 192.168.1.166 worker02 <none> <none>
kube-proxy-ql576 1/1 Running 10 (18m ago) 14d 192.168.1.164 master <none> <none>
kube-proxy-zl68t 1/1 Running 8 (14m ago) 106m 192.168.1.165 worker01 <none> <none>
kube-scheduler-master 1/1 Running 41 (18m ago) 14d 192.168.1.164 master <none> <none>
metrics-server-98bc7f888-xtdxd 1/1 Running 7 (14m ago) 99m 10.10.5.8 worker01 <none> <none>
旁注:我还禁用了apparmor
(主控和工人):
sudo systemctl stop apparmor && sudo systemctl disable apparmor