CrashLoopBackOff 中 kube-system 命名空间中的 pod;然而,集群似乎有效

CrashLoopBackOff 中 kube-system 命名空间中的 pod;然而,集群似乎有效

集群信息:

Kubernetes version: v1.28.2
Cloud being used: Virtualbox
Installation method: Kubernetes Cluster VirtualBox
Host OS: Ubuntu 22.04.3 LTS
CNI and version: calico
CRI and version: containerd://1.7.2

集群包含 1 个 Master 节点和 2 个 Worker 节点。集群启动一段时间(启动后 1-2 分钟)看起来不错:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS             RESTARTS          AGE     IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (2m11s ago)    13d     10.10.219.98    master     <none>           <none>
calico-node-bqlnm                          1/1     Running            3 (2m11s ago)     4d2h    192.168.1.164   master     <none>           <none>
calico-node-mrd86                          1/1     Running            105 (2d20h ago)   4d2h    192.168.1.165   worker01   <none>           <none>
calico-node-r6w9s                          1/1     Running            110 (2d20h ago)   4d2h    192.168.1.166   worker02   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running            11 (2m11s ago)    13d     10.10.219.100   master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (2m11s ago)    13d     10.10.219.99    master     <none>           <none>
etcd-master                                1/1     Running            67 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running            43 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running            47 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-proxy-ffnzb                           1/1     Running            122 (95s ago)     12d     192.168.1.165   worker01   <none>           <none>
kube-proxy-hf4mx                           1/1     Running            108 (78s ago)     12d     192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running            15 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
kube-scheduler-master                      1/1     Running            46 (2m11s ago)    13d     192.168.1.164   master     <none>           <none>
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   68 (18s ago)      3d21h   10.10.30.94     worker02   <none>           <none>

然而,几分钟后,kube-system 命名空间中的 Pod 开始抖动/崩溃。

lab@master:~$ kubectl -nkube-system get po
NAME                                       READY   STATUS             RESTARTS          AGE
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running            12 (19m ago)      13d
calico-node-bqlnm                          0/1     Running            3 (19m ago)       4d2h
calico-node-mrd86                          0/1     CrashLoopBackOff   111 (2m28s ago)   4d2h
calico-node-r6w9s                          0/1     CrashLoopBackOff   116 (2m15s ago)   4d2h
coredns-5dd5756b68-njtpf                   1/1     Running            11 (19m ago)      13d
coredns-5dd5756b68-pxn8l                   1/1     Running            10 (19m ago)      13d
etcd-master                                1/1     Running            67 (19m ago)      13d
kube-apiserver-master                      1/1     Running            43 (19m ago)      13d
kube-controller-manager-master             1/1     Running            47 (19m ago)      13d
kube-proxy-ffnzb                           0/1     CrashLoopBackOff   127 (42s ago)     12d
kube-proxy-hf4mx                           0/1     CrashLoopBackOff   113 (2m17s ago)   12d
kube-proxy-ql576                           1/1     Running            15 (19m ago)      13d
kube-scheduler-master                      1/1     Running            46 (19m ago)      13d
metrics-server-54cb77cffd-q292x            0/1     CrashLoopBackOff   73 (64s ago)      3d22h

我完全不清楚出了什么问题,通过检查 pod 描述,我看到重复的事件:

lab@master:~$ kubectl -nkube-system logs kube-proxy-ffnzb
.
.
.
Events:
  Type     Reason          Age                      From     Message
  ----     ------          ----                     ----     -------
  Normal   Killing         2d20h (x50 over 3d1h)    kubelet  Stopping container kube-proxy
  Warning  BackOff         2d20h (x1146 over 3d1h)  kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)
  Normal   Pulled          8m11s (x3 over 10m)      kubelet  Container image "registry.k8s.io/kube-proxy:v1.28.6" already present on machine
  Normal   Created         8m10s (x3 over 10m)      kubelet  Created container kube-proxy
  Normal   Started         8m10s (x3 over 10m)      kubelet  Started container kube-proxy
  Normal   SandboxChanged  6m56s (x4 over 10m)      kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         4m41s (x4 over 10m)      kubelet  Stopping container kube-proxy
  Warning  BackOff         12s (x28 over 10m)       kubelet  Back-off restarting failed container kube-proxy in pod kube-proxy-ffnzb_kube-system(79f808ba-f450-4103-80a9-0e75af2e77cf)

笔记! 这种情况并不妨碍我部署一些示例部署(nginx) - 它似乎运行稳定。然而,我尝试添加指标服务器,但这个服务器崩溃了(可能与 kube-system 命名空间中的 CrashLoopBackOff pod 有关)

有什么想法可能是错误的/还有什么地方可以解决问题吗?

答案1

有人提示我检查SystemdCgroup配置containerd文件。下列的这个链接

就我而言,事实证明我失踪了:/etc/containerd/config.toml在主节点上。

  • 生成它:
    sudo containerd config default | sudo tee /etc/containerd/config.toml
    
  • 下一个变化SystemdCgroup = true/etc/containerd/config.toml
  • 重启containerd服务:
    systemctl restart containerd
    

然而,这使我的集群处于以下状态:

lab@master:~$ kubectl -nkube-system get po
The connection to the server master:6443 was refused - did you specify the right host or port?
lab@master:~$ kubectl get nodes
The connection to the server master:6443 was refused - did you specify the right host or port?

我已将其恢复false并重新启动containerd。但是,在工作节点上我将其保留为true.

这解决了问题:

lab@master:~$ kubectl -nkube-system get po -o wide
NAME                                       READY   STATUS    RESTARTS       AGE    IP              NODE       NOMINATED NODE   READINESS GATES
calico-kube-controllers-7ddc4f45bc-4qx7l   1/1     Running   8 (18m ago)    14d    10.10.219.86    master     <none>           <none>
calico-node-c4rxp                          1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
calico-node-dhzr8                          1/1     Running   7 (18m ago)    14d    192.168.1.164   master     <none>           <none>
calico-node-wqv8w                          1/1     Running   1 (14m ago)    27m    192.168.1.165   worker01   <none>           <none>
coredns-5dd5756b68-njtpf                   1/1     Running   7 (18m ago)    14d    10.10.219.88    master     <none>           <none>
coredns-5dd5756b68-pxn8l                   1/1     Running   6 (18m ago)    14d    10.10.219.87    master     <none>           <none>
etcd-master                                1/1     Running   62 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-apiserver-master                      1/1     Running   38 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-controller-manager-master             1/1     Running   42 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-mgsdr                           1/1     Running   7 (14m ago)    89m    192.168.1.166   worker02   <none>           <none>
kube-proxy-ql576                           1/1     Running   10 (18m ago)   14d    192.168.1.164   master     <none>           <none>
kube-proxy-zl68t                           1/1     Running   8 (14m ago)    106m   192.168.1.165   worker01   <none>           <none>
kube-scheduler-master                      1/1     Running   41 (18m ago)   14d    192.168.1.164   master     <none>           <none>
metrics-server-98bc7f888-xtdxd             1/1     Running   7 (14m ago)    99m    10.10.5.8       worker01   <none>           <none>

旁注:我还禁用了apparmor(主控和工人):

sudo systemctl stop apparmor && sudo systemctl disable apparmor

相关内容