我想使用 GKE 节点自动配置来创建一个具有按需 GPU 的节点池(即当我启动需要 GPU 资源的作业时)。
按照 GCP 教程,我设置了一个启用的集群cluster autoscaling
。NAPnode auto-provisioning
设置了 CPU、内存和 GPU 的限制:
resourceLimits:
- maximum: '15'
minimum: '1'
resourceType: cpu
- maximum: '150'
minimum: '1'
resourceType: memory
- maximum: '2'
resourceType: nvidia-tesla-k80
我知道 NAP 有效,因为它已经为我启动了几个节点,但它们都是“正常节点”(没有 GPU)。
现在,要“强制” NAP 创建带有 GPU 机器的节点池。在此之前,集群上不存在 GPU 节点。为此,我正在使用这样的配置文件创建一个作业:
apiVersion: batch/v1
kind: Job
metadata:
name: training-job
spec:
ttlSecondsAfterFinished: 100
template:
metadata:
name: training-job
spec:
nodeSelector:
gpu: "true"
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-accelerator: nvidia-tesla-k80
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: gpu-test
image: przomys/gpu-test
resources:
requests:
cpu: 500m
limits:
nvidia.com/gpu: 2 # requesting 2 GPU
restartPolicy: Never # Do not restart containers after they exit
作业正在创建,但随后被标记为“不可安排”,并且 CA Log 给出了这样的错误:
{
"noDecisionStatus": {
"measureTime": "1650370630",
"noScaleUp": {
"unhandledPodGroups": [
{
"rejectedMigs": [
{
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
]
},
"mig": {
"zone": "us-central1-c",
"nodepool": "pool-3",
"name": "gke-cluster-activeid-pool-3-af526144-grp"
}
},
{
"mig": {
"name": "gke-cluster-activeid-nap-e2-standard--c7a4d4f1-grp",
"zone": "us-central1-c",
"nodepool": "nap-e2-standard-2-w52e84k8"
},
"reason": {
"parameters": [
"NodeAffinity",
"node(s) didn't match Pod's node affinity/selector"
],
"messageId": "no.scale.up.mig.failing.predicate"
}
}
],
"napFailureReasons": [
{
"parameters": [
"Any GPU."
],
"messageId": "no.scale.up.nap.pod.gpu.no.limit.defined"
}
],
"podGroup": {
"totalPodCount": 1,
"samplePod": {
"controller": {
"apiVersion": "batch/v1",
"kind": "Job",
"name": "training-job"
},
"namespace": "default",
"name": "training-job-7k8zd"
}
}
}
],
"unhandledPodGroupsTotalCount": 1
}
}
}
我的猜测是无.扩展.向上.nap.pod.gpu.无.限制.定义是最重要的部分。GCP 教程指点我这里。但是我已经定义了这个限制,因此我没有主意了……
也许有人知道我做错了什么?