EKS 突然因磁盘压力而失效

2024-6-1 • tag-icon

我们有一个 EKS 集群，其中有两个 t3.small 节点，具有 20Gi 的临时存储。该集群目前仅运行两个小型 Nodejs（node:12-alpine）应用程序。

几个星期以来，这一直运行良好，但现在我们突然收到磁盘压力错误。

$ kubectl describe nodes
Name:               ip-192-168-101-158.ap-southeast-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.small
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ap-southeast-1
                    failure-domain.beta.kubernetes.io/zone=ap-southeast-1a
                    kubernetes.io/hostname=ip-192-168-101-158.ap-southeast-1.compute.internal
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 31 Mar 2019 17:14:58 +0800
Taints:             node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Sun, 12 May 2019 12:22:47 +0800   Sun, 12 May 2019 06:51:38 +0800   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:14:58 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 12 May 2019 12:22:47 +0800   Sun, 31 Mar 2019 17:15:31 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.101.158
  ExternalIP:   54.169.250.255
  InternalDNS:  ip-192-168-101-158.ap-southeast-1.compute.internal
  ExternalDNS:  ec2-54-169-250-255.ap-southeast-1.compute.amazonaws.com
  Hostname:     ip-192-168-101-158.ap-southeast-1.compute.internal
Capacity:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           20959212Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      2002320Ki
 pods:                        11
Allocatable:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           19316009748
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      1899920Ki
 pods:                        11
System Info:
 Machine ID:                 ec2aa2ecfbbbdd798e2da086fc04afb6
 System UUID:                EC2AA2EC-FBBB-DD79-8E2D-A086FC04AFB6
 Boot ID:                    62c5eb9d-5f19-4558-8883-2da48ab1969c
 Kernel Version:             4.14.106-97.85.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.1
 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///ap-southeast-1a/i-0a38342b60238d83e
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests  Limits
  --------                    --------  ------
  cpu                         0 (0%)    0 (0%)
  memory                      0 (0%)    0 (0%)
  ephemeral-storage           0 (0%)    0 (0%)
  attachable-volumes-aws-ebs  0         0
Events:
  Type     Reason                Age                    From                                                         Message
  ----     ------                ----                   ----                                                         -------
  Warning  ImageGCFailed         5m15s (x333 over 40h)  kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 1423169945 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet  17s (x2809 over 3d4h)  kubelet, ip-192-168-101-158.ap-southeast-1.compute.internal  Attempting to reclaim ephemeral-storage


Name:               ip-192-168-197-198.ap-southeast-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.small
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ap-southeast-1
                    failure-domain.beta.kubernetes.io/zone=ap-southeast-1c
                    kubernetes.io/hostname=ip-192-168-197-198.ap-southeast-1.compute.internal
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 31 Mar 2019 17:15:02 +0800
Taints:             node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Sun, 12 May 2019 12:22:42 +0800   Sat, 11 May 2019 21:53:44 +0800   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Sun, 12 May 2019 12:22:42 +0800   Sun, 31 Mar 2019 17:15:02 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 12 May 2019 12:22:42 +0800   Thu, 09 May 2019 06:50:56 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   192.168.197.198
  ExternalIP:   13.229.138.38
  InternalDNS:  ip-192-168-197-198.ap-southeast-1.compute.internal
  ExternalDNS:  ec2-13-229-138-38.ap-southeast-1.compute.amazonaws.com
  Hostname:     ip-192-168-197-198.ap-southeast-1.compute.internal
Capacity:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           20959212Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      2002320Ki
 pods:                        11
Allocatable:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           19316009748
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      1899920Ki
 pods:                        11
System Info:
 Machine ID:                 ec27ee0765e86a14ed63d771073e63fb
 System UUID:                EC27EE07-65E8-6A14-ED63-D771073E63FB
 Boot ID:                    7869a0ee-dc2f-4082-ae3f-42c5231ab0e3
 Kernel Version:             4.14.106-97.85.amzn2.x86_64
 OS Image:                   Amazon Linux 2
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.1
 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///ap-southeast-1c/i-0bd4038f4dade284e
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests  Limits
  --------                    --------  ------
  cpu                         0 (0%)    0 (0%)
  memory                      0 (0%)    0 (0%)
  ephemeral-storage           0 (0%)    0 (0%)
  attachable-volumes-aws-ebs  0         0
Events:
  Type     Reason                Age                      From                                                         Message
  ----     ------                ----                     ----                                                         -------
  Warning  EvictionThresholdMet  5m40s (x4865 over 3d5h)  kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal  Attempting to reclaim ephemeral-storage
  Warning  ImageGCFailed         31s (x451 over 45h)      kubelet, ip-192-168-197-198.ap-southeast-1.compute.internal  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 4006422937 bytes, but freed 0 bytes

我不太清楚如何调试这个问题，但感觉 K8s 无法删除节点上旧的未使用的 Docker 镜像。有什么办法可以验证这个假设吗？还有其他想法吗？

答案1

这是我的解决方法：

kubectl drain --delete-local-data --ignore-daemonsets $NODE_NAME && kubectl uncordon $NODE_NAME

它会清空所有本地数据并驱逐所有 pod，然后重新运行所有 pod。但我正在寻找根本问题。

答案1

相关内容