我目前面临的一个问题是,我的一个 Kubernetes 节点不断遇到 DiskPressure,导致 pod 被驱逐和服务中断。尽管我们尽了最大努力,但仍无法确定此问题的根本原因。我正在寻求有关如何有效分析和调试问题的指导。
以下是背景信息和我们迄今为止采取的措施:
- Kubernetes 版本:1.24.1
- 节点规格:
- 操作系统:Ubuntu 20.04.4 LTS(amd64)
- 内核:5.13.0-51-generic
- 容器运行时:containerd://1.6.6
- Pod 和资源利用率:
Capacity:
cpu: 16
ephemeral-storage: 256Gi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65776132Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 241591910Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65673732Ki
pods: 110
System Info:
Kernel Version: 5.13.0-51-generic
OS Image: Ubuntu 20.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.6
Kubelet Version: v1.24.1
Kube-Proxy Version: v1.24.1
Non-terminated Pods: (41 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-7686fcb9bc-jptct 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
cert-manager cert-manager-cainjector-69d77789d-kmzb9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
cert-manager cert-manager-webhook-84c6f5779-gs8h7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops external-dns-7bdcbb7658-rvwqs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops filebeat-7l62m 100m (0%) 0 (0%) 100Mi (0%) 200Mi (0%) 20m
devops jenkins-597c5d498c-prs5x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14m
devops kibana-6b577f877c-28ck4 100m (0%) 1 (6%) 0 (0%) 0 (0%) 46m
devops logstash-788d5f89b-pr79c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14m
devops nexus-6db65f8744-cxlhs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-authoritative-85dcd685c4-4mts8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-recursor-757854d6f8-5z25p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-recursor-nok8s-5db55c87f9-77ww6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops sonarqube-5767c467c9-2crz2 0 (0%) 0 (0%) 200Mi (0%) 0 (0%) 46m
devops sonarqube-postgres-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
ingress-nginx ingress-nginx-controller-75f6588c7b-gw77s 100m (0%) 0 (0%) 90Mi (0%) 0 (0%) 13m
jenkins-agents my-cluster-dev-tenant-develop-328-76mr4-ns67p-3xczd 0 (0%) 0 (0%) 350Mi (0%) 0 (0%) 72s
kube-system calico-kube-controllers-56cdb7c587-zmz4t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
kube-system calico-node-pshn4 250m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system coredns-6d4b75cb6d-nrbmq 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 46m
kube-system coredns-6d4b75cb6d-q9hvs 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 46m
kube-system etcd-my-cluster 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 354d
kube-system kube-apiserver-my-cluster 250m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-controller-manager-my-cluster 200m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-proxy-qwmrd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-scheduler-my-cluster 100m (0%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system metrics-server-5744cd7dbb-h758l 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 34m
kube-system metrics-server-6bf466fbf5-nt5k6 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 47m
kube-system node-shell-0c3bde15-32fa-4831-9f05-ebfe5d14a909 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43m
kube-system node-shell-692c6032-8301-44ac-b12e-e5a222a6f80a 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8m6s
lens-metrics prometheus-0 100m (0%) 0 (0%) 512Mi (0%) 0 (0%) 14m
imaginary-dev mailhog-7f666fdfbf-xgcwf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev ms-nginx-766bf76f87-ss8h6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev ms-tenant-f847987cc-rf9db 400m (2%) 500m (3%) 500M (0%) 700M (1%) 46m
imaginary-dev ms-webapp-5d6bcdcc4f-x68s4 100m (0%) 200m (1%) 200M (0%) 400M (0%) 46m
imaginary-dev mysql-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev redis-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat mailhog-685b7c6844-cpmfp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat ms-tenant-6965d68df8-nlm7p 500m (3%) 600m (3%) 512M (0%) 704M (1%) 46m
imaginary-uat ms-webapp-6cb7fb6c65-pfhsh 100m (0%) 200m (1%) 200M (0%) 400M (0%) 46m
imaginary-uat mysql-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat redis-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2800m (17%) 2500m (15%)
memory 3395905792 (5%) 2770231040 (4%)
ephemeral-storage 2Gi (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
- 磁盘使用情况分析
du
:我们使用和命令查看了节点上的磁盘使用情况df
。
尽管我们做了上述努力,但我们仍无法确定 DiskPressure 问题的确切原因。我们怀疑它可能与过多的日志记录、大型容器镜像或资源分配效率低下有关,但我们不确定如何确认和解决这些怀疑。
因此,我恳请您提供以下方面的帮助:
- 分析和调试 Kubernetes 节点中的 DiskPressure 问题的最佳实践。
- 用于识别占用最多磁盘空间的特定进程或 pod 的工具或技术。
- 优化 Kubernetes 中的资源分配和磁盘使用以缓解 DiskPressure 问题的策略。
- 关于有效解决此问题的任何其他见解或建议。
任何建议、推荐或基于经验的见解都将不胜感激。提前感谢您的帮助!