我目前正在使用一个 Kubernetes 集群,该集群通过 EKS 托管在 AWS 上,它遇到了奇怪的(对我来说)故障。我们的节点(实例类型c5.2xlarge
,AMI ami-0f54a2f7d2e9c88b3
/ amazon-eks-node-v25
)一直运行,直到负载没有明显变化,日志中开始出现大量错误kubelet
。(我正在通过查看它journalctl -u kubelet
)。
错误消息并不完全一致——不同的节点在发生故障之前会显示不同的事件集——但最终节点会进入某种NotReady
状态。有时,它们会自行恢复,但这种情况发生的概率和时间是可变的。
以下是节点状态发生改变之前的日志示例:
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: W1205 21:41:57.671381 4051 fs.go:571] Killing cmd [nice -n 19 du -s /var/lib/docker/overlay2/2af435b23328675b6ccddcd29da7a8681118ae90c78755933916d15c247653cc/diff] due to timeout(2m0s)
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.673113 4051 remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.676913 4051 kubelet.go:2114] Container runtime sanity check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.809324 4051 remote_runtime.go:332] ExecSync e264b31c91ae2d10381cbebd0c4a1e3b0deeefcc60dd5762b7f6f3ac9a7c5d1a '/bin/bash -c pgrep python >/dev/null 2>&1' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.833254 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.843768 4051 kubelet_node_status.go:814] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 LastTransitionTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 Reason:KubeletNotReady Message:container runtime is down}
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.933579 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.159892 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.561026 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:59 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:59.381016 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:42:00 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:42:00.985015 4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
在其他情况下,事情在一系列警告中出现了偏差NetworkPlugin cni failed on the status hook for pod "<pod-name>": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "<container-ID>"
。
第三种情况实际上并没有表现出节点状态的变化,而是终止于
kubelet_node_status.go:377] Error updating node status, will retry: error getting node "<node-private-ip>": Unauthorized
这种情况发生在其他错误之后
cni.go:227] Error while adding to cni network: add cmd: failed to assign an IP address to container
和
raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-27618.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-27618.scope: no such file or directory
这真是令人费解,因为节点以似乎不可预测的节奏停止服务,并且没有一致的错误行为。这些故障之间可能存在一个(或多个)统一原因吗?
我很乐意提供有关集群的更多信息或更详细的日志,请告诉我。提前感谢您的帮助!