EKS 集群节点上的容器运行时/kubelet 故障

2024-5-31 • tag-icon

我目前正在使用一个 Kubernetes 集群，该集群通过 EKS 托管在 AWS 上，它遇到了奇怪的（对我来说）故障。我们的节点（实例类型c5.2xlarge，AMI ami-0f54a2f7d2e9c88b3/ amazon-eks-node-v25）一直运行，直到负载没有明显变化，日志中开始出现大量错误kubelet。（我正在通过查看它journalctl -u kubelet）。

错误消息并不完全一致——不同的节点在发生故障之前会显示不同的事件集——但最终节点会进入某种NotReady状态。有时，它们会自行恢复，但这种情况发生的概率和时间是可变的。

以下是节点状态发生改变之前的日志示例：

Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: W1205 21:41:57.671381    4051 fs.go:571] Killing cmd [nice -n 19 du -s /var/lib/docker/overlay2/2af435b23328675b6ccddcd29da7a8681118ae90c78755933916d15c247653cc/diff] due to timeout(2m0s)
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.673113    4051 remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.676913    4051 kubelet.go:2114] Container runtime sanity check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.809324    4051 remote_runtime.go:332] ExecSync e264b31c91ae2d10381cbebd0c4a1e3b0deeefcc60dd5762b7f6f3ac9a7c5d1a '/bin/bash -c pgrep python >/dev/null 2>&1' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.833254    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.843768    4051 kubelet_node_status.go:814] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 LastTransitionTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 Reason:KubeletNotReady Message:container runtime is down}
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.933579    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.159892    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.561026    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:59 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:59.381016    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:42:00 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:42:00.985015    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]

在其他情况下，事情在一系列警告中出现了偏差NetworkPlugin cni failed on the status hook for pod "<pod-name>": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "<container-ID>"。

第三种情况实际上并没有表现出节点状态的变化，而是终止于

kubelet_node_status.go:377] Error updating node status, will retry: error getting node "<node-private-ip>": Unauthorized

这种情况发生在其他错误之后

cni.go:227] Error while adding to cni network: add cmd: failed to assign an IP address to container

和

raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-27618.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-27618.scope: no such file or directory

这真是令人费解，因为节点以似乎不可预测的节奏停止服务，并且没有一致的错误行为。这些故障之间可能存在一个（或多个）统一原因吗？

我很乐意提供有关集群的更多信息或更详细的日志，请告诉我。提前感谢您的帮助！

相关内容