Kubernetes Pod 在 Running 状态和 CrashLoopBackOff 状态之间来回切换

2024-6-2 • tag-icon

Kubernetes Pod 在 Running 状态和 CrashLoopBackOff 状态之间来回切换

我已经在 SOF 中问过这个问题：https://stackoverflow.com/questions/77325355/pods-go-back-and-forth-between-state-running-and-state-crashloopbackoff/77325485

但是，既然已经指出这是一个sys-admin问题，我也在这里问，希望得到专家的提示和帮助

集群信息：

Kubernetes version:
    root@k8s-eu-1-master:~# kubectl version
    Client Version: v1.28.2
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.28.2

使用的云：Contabo Cloud（裸机）安装方法：按照以下步骤操作：https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?utm_content=cmp-true 主机操作系统：Ubuntu 22.04 CNI 和版本：

root@k8s-eu-1-master:~# ls /etc/cni/net.d/
10-flannel.conflist


root@k8s-eu-1-master:~# cat /etc/cni/net.d/10-flannel.conflist 
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

CRI 和版本：

Container Runtime : containerd 

root@k8s-eu-1-master:~# cat /etc/containerd/config.toml  | grep version
version = 2

操作系统：

Ubuntu 22.04

Running我遇到的问题：pod 在State 和CrashLoopBackOffState之间来回切换

这是arango-deployment.yaml文件：https://drive.google.com/file/d/1VfCjQih5aJUEA4HD9ddsQDrZbLmquWIQ/view?usp=share_link。

这是arango-storage.yaml文件：https://drive.google.com/file/d/1hqHU_H2Wr5VFrJLwM9GDUHF17b7_CYIG/view?usp=sharing

kubectl describe pod这是和的输出kubectl describe pod：https://drive.google.com/file/d/1kZsYeKxOa5aSppV3IdS6c7-e8dnoLiiB/view?usp=share_link

syslog这些是包含两个崩溃 pod 之一的节点的最后几行：

Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: I1020   
15:44:10.594513     599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:10.594988     599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: I1020 15:44:21.594619     599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:21.595036     599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"

以下是logs每个 pod 的：https://drive.google.com/file/d/1k5g4d7j2uaGQTN7EMgR5Z794KtAg70CD/view?usp=sharing

Running但问题是他们不断地在各个州之间往返CrashLoopBackOff：

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS             RESTARTS        AGE
arango-deployment-operator-5f4d66bd86-bgqbd   0/1     Running            8 (5m6s ago)    17m
arango-storage-operator-5fcb46574-w2b6b       0/1     CrashLoopBackOff   7 (4m52s ago)   17m

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS    RESTARTS        AGE
arango-deployment-operator-5f4d66bd86-bgqbd   0/1     Running   9 (7s ago)      18m
arango-storage-operator-5fcb46574-w2b6b       0/1     Running   8 (5m53s ago)   18m
root@k8s-eu-1-master:~# kubectl get pods

root@k8s-eu-1-master:~# kubectl get pods
NAME                                          READY   STATUS             RESTARTS        AGE
arango-deployment-operator-5f4d66bd86-bgqbd   0/1     CrashLoopBackOff   9 (2m35s ago)   21m
arango-storage-operator-5fcb46574-w2b6b       0/1     CrashLoopBackOff   9 (2m11s ago)   21m

kubectl describe pod这是第二个 pod 的输出：

root@k8s-eu-1-master:~# kubectl describe arango-storage-operator-
5fcb46574-w2b6b
error: the server doesn't have a resource type "arango-storage-operator-5fcb46574-w2b6b"
root@k8s-eu-1-master:~# kubectl describe pod arango-storage-operator-5fcb46574-w2b6b
Name:             arango-storage-operator-5fcb46574-w2b6b
Namespace:        default
Priority:         0
Service Account:  arango-storage-operator
Node:             k8s-eu-1-worker-2/zz.zzz.zzz.zzz
Start Time:       Fri, 20 Oct 2023 16:29:06 +0200
Labels:           app.kubernetes.io/instance=storage
                  app.kubernetes.io/managed-by=Tiller
                  app.kubernetes.io/name=kube-arangodb
                  helm.sh/chart=kube-arangodb-1.2.34
                  pod-template-hash=5fcb46574
                  release=storage
Annotations:      <none>
Status:           Running
IP:               10.244.0.13
IPs:
  IP:           10.244.0.13
Controlled By:  ReplicaSet/arango-storage-operator-5fcb46574
Containers:
  operator:
    Container ID:  containerd://ce9249b8978d6f902c5162f74e4b4e71d401458587cf0f3a717bd86a75c1b65f
    Image:         arangodb/kube-arangodb:1.2.34
    Image ID:      docker.io/arangodb/kube-arangodb@sha256:a25d031e87ba5b0f3038ce9f346553b69760a3a065fe608727cde188602b59e8
    Port:          8528/TCP
    Host Port:     0/TCP
    Args:
      --scope=legacy
      --operator.storage
      --mode.single
      --log.level=debug
      --chaos.allowed=false
    State:          Running
      Started:      Fri, 20 Oct 2023 16:53:53 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Fri, 20 Oct 2023 16:47:47 +0200
      Finished:     Fri, 20 Oct 2023 16:48:46 +0200
    Ready:          False
    Restart Count:  10
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      500m
      memory:   1Gi
    Liveness:   http-get https://:8528/health delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get https://:8528/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      MY_POD_NAMESPACE:  default (v1:metadata.namespace)
      MY_POD_NAME:       arango-storage-operator-5fcb46574-w2b6b (v1:metadata.name)
      MY_POD_IP:          (v1:status.podIP)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8bdkv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-8bdkv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 5s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  24m                   default-scheduler  Successfully assigned default/arango-storage-operator-5fcb46574-w2b6b to k8s-eu-1-worker-2
  Normal   Pulled     24m                   kubelet            Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 830ms (830ms including waiting)
  Normal   Pulling    23m (x2 over 24m)     kubelet            Pulling image "arangodb/kube-arangodb:1.2.34"
  Normal   Created    23m (x2 over 24m)     kubelet            Created container operator
  Normal   Started    23m (x2 over 24m)     kubelet            Started container operator
  Normal   Pulled     23m                   kubelet            Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 1.033s (1.033s including waiting)
  Warning  Unhealthy  23m (x9 over 24m)     kubelet            Readiness probe failed: Get "https://10.244.0.13:8528/ready": dial tcp 10.244.0.13:8528: connect: connection refused
  Warning  Unhealthy  23m (x6 over 24m)     kubelet            Liveness probe failed: Get "https://10.244.0.13:8528/health": dial tcp 10.244.0.13:8528: connect: connection refused
  Normal   Killing    23m (x2 over 24m)     kubelet            Container operator failed liveness probe, will be restarted
  Normal   Pulled     19m                   kubelet            Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 827ms (827ms including waiting)
  Warning  BackOff    4m48s (x51 over 18m)  kubelet            Back-off restarting failed container operator in pod arango-storage-operator-5fcb46574-w2b6b_default(01f85e27-97bb-45c9-b42b-a2fc2aba0967)

syslog包含第二个 pod 的节点：

 Oct 20 16:59:24 k8s-eu-1-worker-2 kubelet[595]: I1020 16:59:24.180175     595 scope.go:117] "RemoveContainer" containerID="81fad793d8848c7507317c980e37470c06f4e77bb4d8f9b893137dc5fd8f85ef"
Oct 20 16:59:24 k8s-eu-1-worker-2 kubelet[595]: E1020 16:59:24.181224     595 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=operator pod=arango-storage-operator-5fcb46574-w2b6b_default(01f85e27-97bb-45c9-b42b-a2fc2aba0967)\"" pod="default/arango-storage-operator-5fcb46574-w2b6b" podUID="01f85e27-97bb-45c9-b42b-a2fc2aba0967"

我做错了什么？如何让 Pod 保持“正在运行”状态？

相关内容