我已经在 SOF 中问过这个问题:https://stackoverflow.com/questions/77325355/pods-go-back-and-forth-between-state-running-and-state-crashloopbackoff/77325485
但是,既然已经指出这是一个sys-admin
问题,我也在这里问,希望得到专家的提示和帮助
集群信息:
Kubernetes version:
root@k8s-eu-1-master:~# kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2
使用的云:Contabo Cloud(裸机) 安装方法:按照以下步骤操作:https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/?utm_content=cmp-true 主机操作系统:Ubuntu 22.04 CNI 和版本:
root@k8s-eu-1-master:~# ls /etc/cni/net.d/
10-flannel.conflist
root@k8s-eu-1-master:~# cat /etc/cni/net.d/10-flannel.conflist
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
CRI 和版本:
Container Runtime : containerd
root@k8s-eu-1-master:~# cat /etc/containerd/config.toml | grep version
version = 2
操作系统:
Ubuntu 22.04
Running
我遇到的问题:pod 在State 和CrashLoopBackOff
State之间来回切换
这是arango-deployment.yaml
文件:https://drive.google.com/file/d/1VfCjQih5aJUEA4HD9ddsQDrZbLmquWIQ/view?usp=share_link。
这是arango-storage.yaml
文件:https://drive.google.com/file/d/1hqHU_H2Wr5VFrJLwM9GDUHF17b7_CYIG/view?usp=sharing
kubectl describe pod
这是和的输出kubectl describe pod
:https://drive.google.com/file/d/1kZsYeKxOa5aSppV3IdS6c7-e8dnoLiiB/view?usp=share_link
syslog
这些是包含两个崩溃 pod 之一的节点的最后几行:
Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: I1020
15:44:10.594513 599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:10 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:10.594988 599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: I1020 15:44:21.594619 599 scope.go:117] "RemoveContainer" containerID="3e618ac247c1392fd6a6d67fad93d187c0dfae4d2cfe77c6a8b244c831dd0852"
Oct 20 15:44:21 k8s-eu-1-worker-1 kubelet[599]: E1020 15:44:21.595036 599 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=operator pod=arango-deployment-operator-5f4d66bd86-4pxkn_default(397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8)\"" pod="default/arango-deployment-operator-5f4d66bd86-4pxkn" podUID="397bd5c4-2bfc-4ca3-bc7d-bd149932e4b8"
以下是logs
每个 pod 的:https://drive.google.com/file/d/1k5g4d7j2uaGQTN7EMgR5Z794KtAg70CD/view?usp=sharing
Running
但问题是他们不断地在各个州之间往返CrashLoopBackOff
:
root@k8s-eu-1-master:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
arango-deployment-operator-5f4d66bd86-bgqbd 0/1 Running 8 (5m6s ago) 17m
arango-storage-operator-5fcb46574-w2b6b 0/1 CrashLoopBackOff 7 (4m52s ago) 17m
root@k8s-eu-1-master:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
arango-deployment-operator-5f4d66bd86-bgqbd 0/1 Running 9 (7s ago) 18m
arango-storage-operator-5fcb46574-w2b6b 0/1 Running 8 (5m53s ago) 18m
root@k8s-eu-1-master:~# kubectl get pods
root@k8s-eu-1-master:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
arango-deployment-operator-5f4d66bd86-bgqbd 0/1 CrashLoopBackOff 9 (2m35s ago) 21m
arango-storage-operator-5fcb46574-w2b6b 0/1 CrashLoopBackOff 9 (2m11s ago) 21m
kubectl describe pod
这是第二个 pod 的输出:
root@k8s-eu-1-master:~# kubectl describe arango-storage-operator-
5fcb46574-w2b6b
error: the server doesn't have a resource type "arango-storage-operator-5fcb46574-w2b6b"
root@k8s-eu-1-master:~# kubectl describe pod arango-storage-operator-5fcb46574-w2b6b
Name: arango-storage-operator-5fcb46574-w2b6b
Namespace: default
Priority: 0
Service Account: arango-storage-operator
Node: k8s-eu-1-worker-2/zz.zzz.zzz.zzz
Start Time: Fri, 20 Oct 2023 16:29:06 +0200
Labels: app.kubernetes.io/instance=storage
app.kubernetes.io/managed-by=Tiller
app.kubernetes.io/name=kube-arangodb
helm.sh/chart=kube-arangodb-1.2.34
pod-template-hash=5fcb46574
release=storage
Annotations: <none>
Status: Running
IP: 10.244.0.13
IPs:
IP: 10.244.0.13
Controlled By: ReplicaSet/arango-storage-operator-5fcb46574
Containers:
operator:
Container ID: containerd://ce9249b8978d6f902c5162f74e4b4e71d401458587cf0f3a717bd86a75c1b65f
Image: arangodb/kube-arangodb:1.2.34
Image ID: docker.io/arangodb/kube-arangodb@sha256:a25d031e87ba5b0f3038ce9f346553b69760a3a065fe608727cde188602b59e8
Port: 8528/TCP
Host Port: 0/TCP
Args:
--scope=legacy
--operator.storage
--mode.single
--log.level=debug
--chaos.allowed=false
State: Running
Started: Fri, 20 Oct 2023 16:53:53 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 20 Oct 2023 16:47:47 +0200
Finished: Fri, 20 Oct 2023 16:48:46 +0200
Ready: False
Restart Count: 10
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 500m
memory: 1Gi
Liveness: http-get https://:8528/health delay=5s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:8528/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
MY_POD_NAMESPACE: default (v1:metadata.namespace)
MY_POD_NAME: arango-storage-operator-5fcb46574-w2b6b (v1:metadata.name)
MY_POD_IP: (v1:status.podIP)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8bdkv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-8bdkv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 5s
node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned default/arango-storage-operator-5fcb46574-w2b6b to k8s-eu-1-worker-2
Normal Pulled 24m kubelet Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 830ms (830ms including waiting)
Normal Pulling 23m (x2 over 24m) kubelet Pulling image "arangodb/kube-arangodb:1.2.34"
Normal Created 23m (x2 over 24m) kubelet Created container operator
Normal Started 23m (x2 over 24m) kubelet Started container operator
Normal Pulled 23m kubelet Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 1.033s (1.033s including waiting)
Warning Unhealthy 23m (x9 over 24m) kubelet Readiness probe failed: Get "https://10.244.0.13:8528/ready": dial tcp 10.244.0.13:8528: connect: connection refused
Warning Unhealthy 23m (x6 over 24m) kubelet Liveness probe failed: Get "https://10.244.0.13:8528/health": dial tcp 10.244.0.13:8528: connect: connection refused
Normal Killing 23m (x2 over 24m) kubelet Container operator failed liveness probe, will be restarted
Normal Pulled 19m kubelet Successfully pulled image "arangodb/kube-arangodb:1.2.34" in 827ms (827ms including waiting)
Warning BackOff 4m48s (x51 over 18m) kubelet Back-off restarting failed container operator in pod arango-storage-operator-5fcb46574-w2b6b_default(01f85e27-97bb-45c9-b42b-a2fc2aba0967)
syslog
包含第二个 pod 的节点:
Oct 20 16:59:24 k8s-eu-1-worker-2 kubelet[595]: I1020 16:59:24.180175 595 scope.go:117] "RemoveContainer" containerID="81fad793d8848c7507317c980e37470c06f4e77bb4d8f9b893137dc5fd8f85ef"
Oct 20 16:59:24 k8s-eu-1-worker-2 kubelet[595]: E1020 16:59:24.181224 595 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=operator pod=arango-storage-operator-5fcb46574-w2b6b_default(01f85e27-97bb-45c9-b42b-a2fc2aba0967)\"" pod="default/arango-storage-operator-5fcb46574-w2b6b" podUID="01f85e27-97bb-45c9-b42b-a2fc2aba0967"
我做错了什么?如何让 Pod 保持“正在运行”状态?