我们有一个 EKS 集群,其中有两个 t3.small 节点,具有 20Gi 的临时存储。该集群目前仅运行两个小型 Nodejs(node:12-alpine)应用程序。
几个星期以来,这一直运行良好,但现在我们突然收到磁盘压力错误。
$ kubectl describe nodes
Name: ip-192-168-101-158.ap-southeast-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-1
failure-domain.beta.kubernetes.io/zone=ap-southeast-1a
kubernetes.io/hostname=ip-192-168-101-158.ap-southeast-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 31 Mar 2019 17:14:58 +0800
Taints: node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Sun, 12 May 2019 12:22:47 +0800 Sun, 12 May 2019 06:51:38 +0800 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:14:58 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 12 May 2019 12:22:47 +0800 Sun, 31 Mar 2019 17:15:31 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.101.158
ExternalIP: 54.169.250.255
InternalDNS: ip-192-168-101-158.ap-southeast-1.compute.internal
ExternalDNS: ec2-54-169-250-255.ap-southeast-1.compute.amazonaws.com
Hostname: ip-192-168-101-158.ap-southeast-1.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2002320Ki
pods: 11
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 19316009748
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1899920Ki
pods: 11
System Info:
Machine ID: ec2aa2ecfbbbdd798e2da086fc04afb6
System UUID: EC2AA2EC-FBBB-DD79-8E2D-A086FC04AFB6
Boot ID: 62c5eb9d-5f19-4558-8883-2da48ab1969c
Kernel Version: 4.14.106-97.85.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.12.7
Kube-Proxy Version: v1.12.7
ProviderID: aws:///ap-southeast-1a/i-0a38342b60238d83e
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ImageGCFailed 5m15s (x333 over 40h) kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 1423169945 bytes, but freed 0 bytes
Warning EvictionThresholdMet 17s (x2809 over 3d4h) kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal Attempting to reclaim ephemeral-storage
Name: ip-192-168-197-198.ap-southeast-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-1
failure-domain.beta.kubernetes.io/zone=ap-southeast-1c
kubernetes.io/hostname=ip-192-168-197-198.ap-southeast-1.compute.internal
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 31 Mar 2019 17:15:02 +0800
Taints: node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Sun, 12 May 2019 12:22:42 +0800 Sat, 11 May 2019 21:53:44 +0800 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Sun, 12 May 2019 12:22:42 +0800 Sun, 31 Mar 2019 17:15:02 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 12 May 2019 12:22:42 +0800 Thu, 09 May 2019 06:50:56 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.197.198
ExternalIP: 13.229.138.38
InternalDNS: ip-192-168-197-198.ap-southeast-1.compute.internal
ExternalDNS: ec2-13-229-138-38.ap-southeast-1.compute.amazonaws.com
Hostname: ip-192-168-197-198.ap-southeast-1.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2002320Ki
pods: 11
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 19316009748
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1899920Ki
pods: 11
System Info:
Machine ID: ec27ee0765e86a14ed63d771073e63fb
System UUID: EC27EE07-65E8-6A14-ED63-D771073E63FB
Boot ID: 7869a0ee-dc2f-4082-ae3f-42c5231ab0e3
Kernel Version: 4.14.106-97.85.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.12.7
Kube-Proxy Version: v1.12.7
ProviderID: aws:///ap-southeast-1c/i-0bd4038f4dade284e
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning EvictionThresholdMet 5m40s (x4865 over 3d5h) kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 31s (x451 over 45h) kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4006422937 bytes, but freed 0 bytes
我不太清楚如何调试这个问题,但感觉 K8s 无法删除节点上旧的未使用的 Docker 镜像。有什么办法可以验证这个假设吗?还有其他想法吗?
答案1
这是我的解决方法:
kubectl drain --delete-local-data --ignore-daemonsets $NODE_NAME && kubectl uncordon $NODE_NAME
它会清空所有本地数据并驱逐所有 pod,然后重新运行所有 pod。但我正在寻找根本问题。