GKE 无法在新添加的具有 GPU 的节点上安排需要 GPU 的新创建的 Pod

2024-6-1 • tag-icon

GKE 无法在新添加的具有 GPU 的节点上安排需要 GPU 的新创建的 Pod

当添加带有 GPU 的新池节点时，Google Kubernetes Engine 无法安排在这些新节点上需要 GPU 的新创建的 pod，我猜应该是自动的但不适用于 GPU 资源，新 pod 永远处于“待处理”状态，如何解决这个问题？

编辑：这是部署 yaml 文件，我的目的不是将部署绑定到特定节点：

    ---
    apiVersion: machinelearning.seldon.io/v1alpha2
    kind: SldDeployment
    metadata:
      labels:
        app: sld
      name: trs-sld
      namespace: trs
    spec:
      annotations:
        project_name: Trs
        deployment_version: v1.0
        seldon.io/rest-connect-retries: '5'
        seldon.io/grpc-connect-retries: '5'
        seldon.io/istio-retries: '10' 
        seldon.io/istio-retries-timeout: '12' 
      name: trs
      predictors:
      - componentSpecs:
        - spec:
            containers:
            - image: eu.gcr.io/trs-141513/trs-native:latest
              imagePullPolicy: Always
              name: classifier
              resources:
                limits:
                  nvidia.com/gpu: 2
              volumeMounts:
                - mountPath: /etc/google_storage/creds
                  name: service-account-creds
                  readOnly: true
            volumes:
              - name: service-account-creds
                secret:
                  secretName: service-account-creds
            terminationGracePeriodSeconds: 20
        graph:
          children: []
          name: classifier
          endpoint:
            type: REST
          type: MODEL
        name: model
        replicas: 1
        annotations:
          predictor_version: v1.0
    ---

答案1

事实证明，每次添加新节点时都需要安装 GPU 驱动程序，例如对于 Ubuntu 容器：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

答案1

相关内容