gcloud GKE 在添加节点池时重新启动 pod

2024-5-31 • tag-icon

我们正在使用 gcloud kubernetes 引擎进行机器学习算法开发。我们设置了一个带有一个 pod 的集群用于代码开发，然后启动一个带有 256 个可抢占 minion 的临时节点池来在我们的数据集上测试算法。开发人员通过 ssh 进入开发 pod，编辑代码，然后在 minion 池上运行 kuberenetes 作业。

问题是，当我们创建 minions 节点池时，默认池中的开发 pod 经常（并非总是）被终止并重新启动。为什么？minion 节点池通常需要大约 3-5 分钟才能启动。似乎 gcloud 必须升级默认节点池以适应 minions 节点池。有没有办法预先分配集群以避免重新启动，或者减少 minions 的启动时间？

以下是我正在使用的命令：

创建初始集群：

gcloud beta container clusters create $CLUSTER_NAME \
        --machine-type=n1-highmem-4 \
        --min-cpu-platform="Intel Sandy Bridge" \
        --num-nodes=1 \
        --enable-autoscaling \
        --min-nodes=1 \
        --max-nodes=4 \
        --disk-size=50 \
        --node-labels=algoalpha=control \
        --scopes=cloud-platform,cloud-source-repos-ro

创建集群脚本：https://gist.github.com/4590040f27f3cf17562baae5ae245b60

创建小兵

gcloud beta container node-pools create algoalpha-minions \
       --cluster $CLUSTER_NAME \
       --enable-autoscaling \
       --num-nodes=$NUM_NODES \
       --min-nodes=0 \
       --max-nodes=$((NUM_NODES * 2)) \
       --preemptible \
       --machine-type=n1-highmem-16 \
       --disk-size=20 \
       --min-cpu-platform="Intel Sandy Bridge" \
       --node-labels=algoalpha=minion \
       --node-taints=cloud.google.com/gke-preemptible="true":NoSchedule

创建小兵脚本：https://gist.github.com/1391658975d3a28444ac823233c334da

有一个更好的方法吗？

相关内容