我在使用 nvidia t4 GPU 创建 GKE 部署时遇到了一些问题(Node scale up in zones europe-west1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
),所以我想我可以尝试创建资源预留,然后在我的 GKE 集群中使用它。
我已经创建了一个资源预留,描述如下:
$ gcloud compute reservations describe reservation-t4-gpu --zone europe-west1-d
creationTimestamp: '2024-03-02T10:54:53.354-08:00'
id: '7770049776700017426'
kind: compute#reservation
name: reservation-t4-gpu
resourceStatus: {}
selfLink: https://www.googleapis.com/compute/v1/projects/my-project/zones/europe-west1-d/reservations/reservation-t4-gpu
shareSettings:
shareType: LOCAL
specificReservation:
assuredCount: '1'
count: '1'
inUseCount: '0'
instanceProperties:
guestAccelerators:
- acceleratorCount: 1
acceleratorType: nvidia-tesla-t4
machineType: custom-1-8192-ext
minCpuPlatform: Any CPU Platform
specificReservationRequired: false
status: READY
zone: https://www.googleapis.com/compute/v1/projects/my-project/zones/europe-west1-d
并想在我的集群中使用它,部分描述如下:(如果需要更多详细信息请告知我)
autopilot:
enabled: true
resourceLimits:
- maximum: '1000000000'
resourceType: cpu
- maximum: '1000000000'
resourceType: memory
- maximum: '1000000000'
resourceType: nvidia-tesla-t4
- maximum: '1000000000'
resourceType: nvidia-tesla-a100
currentMasterVersion: 1.28.6-gke.1456000
zone: europe-west1
selfLink: https://container.googleapis.com/v1/projects/my-project/locations/europe-west1/clusters/my-cluster
然后我按照 google docs 示例<这里> 并应用以下 pod:
apiVersion: v1
kind: Pod
metadata:
name: specific-same-project-pod
spec:
nodeSelector:
cloud.google.com/compute-class: "Accelerator"
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/reservation-name: reservation-t4-gpu
cloud.google.com/reservation-affinity: "specific"
containers:
- name: my-container
image: "k8s.gcr.io/pause"
resources:
requests:
cpu: 1
memory: "8Gi"
limits:
nvidia.com/gpu: 1
但它看起来并不消耗保留资源:
kubectl describe pod specific-same-project-pod
Name: specific-same-project-pod
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: <none>
Annotations: autopilot.gke.io/resource-adjustment:
{"input":{"containers":[{"limits":{"nvidia.com/gpu":"1"},"requests":{"cpu":"1","memory":"8Gi","nvidia.com/gpu":"1"},"name":"my-container"}...
autopilot.gke.io/warden-version: 2.8.73
cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-02T22:50:14+0000
cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Containers:
my-container:
Image: k8s.gcr.io/pause
Port: <none>
Host Port: <none>
Limits:
cloud.google.com/pod-slots: 1
nvidia.com/gpu: 1
Requests:
cloud.google.com/pod-slots: 1
cpu: 1
ephemeral-storage: 1Gi
memory: 8Gi
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ql88t (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-ql88t:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: cloud.google.com/compute-class=Accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-accelerator-count=1
cloud.google.com/pod-isolation=1
cloud.google.com/reservation-affinity=specific
cloud.google.com/reservation-name=reservation-t4-gpu
Tolerations: cloud.google.com/compute-class=Accelerator:NoSchedule
cloud.google.com/gke-accelerator=nvidia-tesla-t4:NoSchedule
cloud.google.com/machine-family:NoSchedule op=Exists
cloud.google.com/pod-slots:NoSchedule op=Exists
kubernetes.io/arch:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 1s (x2 over 35s) gke.io/optimize-utilization-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 1s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 21 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}
此时我不知道哪里出了问题。这是同一个项目、同一个地区、同一个资源请求。我按照教程中的方法做了所有事情。有什么想法吗?