我的 kubernetes 集群卡在终止状态。以下是当前状态。
豆荚:
kubectl get po
NAME READY STATUS RESTARTS AGE
dashboard-0 1/1 Terminating 0 3h12m
data-cruncher-0 1/2 Terminating 0 3h12m
db-0 3/3 Terminating 0 3h12m
prometheus-0 3/3 Terminating 0 3h12m
register-0 3/3 Terminating 0 3h12m
pod 日志显示授权错误。
kubectl logs dashboard-0
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
statefulset、部署、daemonset、事件、服务:
[ec2-user@ip-172-31-7-229 ~]$ kubectl get statefulset
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get deploy
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.20.0.1 <none> 443/TCP 3h20m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get daemonset
No resources found in default namespace.
pvc 和 pv:
[ec2-user@ip-172-31-7-229 ~]$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
db-persistent-storage-db-0 Bound pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO ssd 3h14m
prometheus-pvc Terminating pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO hdd 3h14m
register-pvc Terminating pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO ssd 3h14m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO Delete Bound default/prometheus-pvc hdd 3h14m
pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO Delete Bound default/db-persistent-storage-db-0 ssd 3h14m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO Delete Bound default/register-pvc ssd 3h14m
节点显示为:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-134-174.us-west-2.compute.internal NotReady <none> 3h17m v1.21.12-eks-5308cf7
ip-10-0-142-12.us-west-2.compute.internal NotReady <none> 3h15m v1.21.12-eks-5308cf7
PVC描述:
kubectl describe pvc prometheus-pvc
Name: prometheus-pvc
Namespace: default
StorageClass: hdd
Status: Terminating (lasts 3h11m)
Volume: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: app=prometheus
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
volume.kubernetes.io/selected-node: ip-10-0-134-174.us-west-2.compute.internal
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: prometheus-0
Events: <none>
还有 pv:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
STORAGECLASS REASON AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO Delete Bound default/prometheus-pvc hdd 3h18m
pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO Delete Bound default/db-persistent-storage-db-0 ssd 3h18m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO Delete Bound default/register-pvc ssd 3h18m
还有,pv 描述。
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: hdd
Status: Bound
Claim: default/prometheus-pvc
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 20Gi
Node Affinity:
Required Terms:
Term 0: topology.kubernetes.io/zone in [us-west-2b]
topology.kubernetes.io/region in [us-west-2]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2b/vol-00d432f06a2fbd806
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
我尝试从 aws web 控制台释放上述描述中的 EBS 卷,它立即附加了另一个卷并卡在相同的终止状态。
[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
90s Normal SuccessfulAttachVolume pod/prometheus-0 AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"
[ec2-user@ip-172-31-7-229 ~]$ kubctl describe po prometheus-0
-bash: kubctl: command not found
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe po prometheus-0
Name: prometheus-0
Namespace: default
Priority: 0
Node: ip-10-0-134-174.us-west-2.compute.internal/10.0.134.174
Start Time: Sat, 25 Jun 2022 11:51:44 +0000
Labels: app=prometheus
controller-revision-hash=prometheus-5c84fc57f4
statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: kubectl.kubernetes.io/default-container: prometheus
kubernetes.io/psp: eks.privileged
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Terminating (lasts 3h20m)
Termination Grace Period: 30s
IP: 10.0.129.199
IPs:
IP: 10.0.129.199
Controlled By: StatefulSet/prometheus
Containers:
prometheus:
Container ID: docker://50f159a0e5e64502d614479791c0b6af381630dca28450dbc7fe237746998457
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus@sha256:d4602ccdc676a9211645fc9710a2668f6b62ee59d080ed0df6bbee7c92f26014
Port: 9090/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 25 Jun 2022 11:51:59 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 5
memory: 10Gi
Requests:
cpu: 25m
memory: 500Mi
Liveness: http-get http://:9090/-/healthy delay=15s timeout=1s period=10s #success=1 #failure=3
Environment:
JOB_NAME: dev-default-nightlye2e
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
pusher:
Container ID: docker://6b994ea0ac6fa15cb6ff51f582037a4c181818f98b1e000e112b731e73639ce5
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
Port: <none>
Host Port: <none>
Command:
./prometheus.sh
900
s3://project-n-logs-us-west-2/dev-default/project-n-dev-default-a2ea-829a
State: Running
Started: Sat, 25 Jun 2022 11:52:00 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 200m
memory: 128Mi
Environment:
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
resize-buddy:
Container ID: docker://ba259862b18d91ae475a13993afe3b377c069dc48153c42f3fdca4c97a60fad1
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
Port: <none>
Host Port: <none>
Command:
./pvc-expander.sh
60
prometheus-pvc
/prometheus
80
2
State: Running
Started: Sat, 25 Jun 2022 11:52:04 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 50m
memory: 256Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
prometheus-storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-pvc
ReadOnly: false
kube-api-access-v5clx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: nodeUse=main
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulAttachVolume 2m15s (x2 over 3h26m) attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"
它会自动再次连接
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: hdd
Status: Terminating (lasts 4m7s)
Claim: default/prometheus-pvc
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 20Gi
Node Affinity:
Required Terms:
Term 0: topology.kubernetes.io/zone in [us-west-2b]
topology.kubernetes.io/region in [us-west-2]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2b/vol-00d432f06a2fbd806
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
如何修复此清理问题?删除终结器不会删除链接的资源
我的新观察:
kubectl delete pod <podname> --force
上述命令删除了 pod,但留下了 pvc 和 pv,因此也对它们运行了相同的强制命令。这删除了 kubernetes 中的资源,但留下了链接到 aws 上的这些 pv 的 EBS 卷。
应该从控制台手动删除这些。
当我从 kubecetl 强制选项删除 pv 时,显示以下消息,似乎它不会使用强制选项检查是否发生删除。
并且显示 pv 的事件仍然存在。
[ec2-user@ip-172-31-14-155 .ssh]$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
37s Normal VolumeDelete persistentvolume/pvc-2d4b48d7-4da1-4872-9bd8-0afb6b94420e error deleting EBS volume "vol-0ceb4ee469f6a35ef" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
37s Normal VolumeDelete persistentvolume/pvc-6d22eb5c-cdf2-40d2-8ce2-9472c979a1de error deleting EBS volume "vol-01e2bbe04a7b30257" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
22s Normal VolumeDelete persistentvolume/pvc-c438bd0c-b90b-4138-a3db-e517fabe4d66 error deleting EBS volume "vol-06791441b4084d524" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
要修复哪个冲突区域,以便它应该直接从地形破坏中清除。
答案1
首先,当“kubectl get nodes”的输出显示节点状态未就绪时,这不是一个好兆头,您应该先修复它。
这导致了您在 pods 日志中看到的问题:
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
它说无法访问节点,因为您的节点不可用。
要修复不可用的节点,您应该查看 CNI 文档。