GKE Autopilot-使用预订

2024-6-2 • tag-icon

我在使用 nvidia t4 GPU 创建 GKE 部署时遇到了一些问题（Node scale up in zones europe-west1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.），所以我想我可以尝试创建资源预留，然后在我的 GKE 集群中使用它。

我已经创建了一个资源预留，描述如下：

$ gcloud compute reservations describe reservation-t4-gpu --zone europe-west1-d
creationTimestamp: '2024-03-02T10:54:53.354-08:00'
id: '7770049776700017426'
kind: compute#reservation
name: reservation-t4-gpu
resourceStatus: {}
selfLink: https://www.googleapis.com/compute/v1/projects/my-project/zones/europe-west1-d/reservations/reservation-t4-gpu
shareSettings:
  shareType: LOCAL
specificReservation:
  assuredCount: '1'
  count: '1'
  inUseCount: '0'
  instanceProperties:
    guestAccelerators:
    - acceleratorCount: 1
      acceleratorType: nvidia-tesla-t4
    machineType: custom-1-8192-ext
    minCpuPlatform: Any CPU Platform
specificReservationRequired: false
status: READY
zone: https://www.googleapis.com/compute/v1/projects/my-project/zones/europe-west1-d

并想在我的集群中使用它，部分描述如下:(如果需要更多详细信息请告知我）

autopilot:
  enabled: true
  resourceLimits:
  - maximum: '1000000000'
    resourceType: cpu
  - maximum: '1000000000'
    resourceType: memory
  - maximum: '1000000000'
    resourceType: nvidia-tesla-t4
  - maximum: '1000000000'
    resourceType: nvidia-tesla-a100
currentMasterVersion: 1.28.6-gke.1456000
zone: europe-west1
selfLink: https://container.googleapis.com/v1/projects/my-project/locations/europe-west1/clusters/my-cluster

然后我按照 google docs 示例<这里> 并应用以下 pod：

apiVersion: v1
kind: Pod
metadata:
  name: specific-same-project-pod
spec:
  nodeSelector:
    cloud.google.com/compute-class: "Accelerator"
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/reservation-name: reservation-t4-gpu
    cloud.google.com/reservation-affinity: "specific"
  containers:
  - name: my-container
    image: "k8s.gcr.io/pause"
    resources:
      requests:
        cpu: 1
        memory: "8Gi"
      limits:
        nvidia.com/gpu: 1

但它看起来并不消耗保留资源：

kubectl describe pod specific-same-project-pod
Name:             specific-same-project-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      autopilot.gke.io/resource-adjustment:
                    {"input":{"containers":[{"limits":{"nvidia.com/gpu":"1"},"requests":{"cpu":"1","memory":"8Gi","nvidia.com/gpu":"1"},"name":"my-container"}...
                  autopilot.gke.io/warden-version: 2.8.73
                  cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-03-02T22:50:14+0000
                  cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Containers:
  my-container:
    Image:      k8s.gcr.io/pause
    Port:       <none>
    Host Port:  <none>
    Limits:
      cloud.google.com/pod-slots:  1
      nvidia.com/gpu:              1
    Requests:
      cloud.google.com/pod-slots:  1
      cpu:                         1
      ephemeral-storage:           1Gi
      memory:                      8Gi
      nvidia.com/gpu:              1
    Environment:                   <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ql88t (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-ql88t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              cloud.google.com/compute-class=Accelerator
                             cloud.google.com/gke-accelerator=nvidia-tesla-t4
                             cloud.google.com/gke-accelerator-count=1
                             cloud.google.com/pod-isolation=1
                             cloud.google.com/reservation-affinity=specific
                             cloud.google.com/reservation-name=reservation-t4-gpu
Tolerations:                 cloud.google.com/compute-class=Accelerator:NoSchedule
                             cloud.google.com/gke-accelerator=nvidia-tesla-t4:NoSchedule
                             cloud.google.com/machine-family:NoSchedule op=Exists
                             cloud.google.com/pod-slots:NoSchedule op=Exists
                             kubernetes.io/arch:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason             Age               From                                   Message
  ----     ------             ----              ----                                   -------
  Warning  FailedScheduling   1s (x2 over 35s)  gke.io/optimize-utilization-scheduler  0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  1s                cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 21 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}

此时我不知道哪里出了问题。这是同一个项目、同一个地区、同一个资源请求。我按照教程中的方法做了所有事情。有什么想法吗？

相关内容