我在对我们的网站进行滚动更新时遇到了一个问题,该网站运行在我们集群上的一个 pod 中的容器中,该 pod 称为 website-cluster。该集群包含两个 pod。一个 pod 有一个运行我们生产网站的容器,另一个 pod 有一个运行同一网站的暂存版本的容器。以下是生产 pod 的复制控制器的 yaml:
apiVersion: v1
kind: ReplicationController
metadata:
# These labels describe the replication controller
labels:
project: "website-prod"
tier: "front-end"
name: "website"
name: "website"
spec: # specification of the RC's contents
replicas: 1
selector:
# These labels indicate which pods the replication controller manages
project: "website-prod"
tier: "front-end"
name: "website"
template:
metadata:
labels:
# These labels belong to the pod, and must match the ones immediately above
# name: "website"
project: "website-prod"
tier: "front-end"
name: "website"
spec:
containers:
- name: "website"
image: "us.gcr.io/skywatch-app/website"
ports:
- name: "http"
containerPort: 80
command: ["nginx", "-g", "daemon off;"]
livenessProbe:
httpGet:
path: "/"
port: 80
initialDelaySeconds: 60
timeoutSeconds: 3
我们做了一个更改,在网站上添加了一个新页面。将其部署到生产 pod 后,我们在测试生产站点时遇到了间歇性的 404 错误。我们使用以下命令来更新 pod(假设当前正在运行的是 95.0 版本):
packer build website.json
gcloud docker push us.gcr.io/skywatch-app/website
gcloud container clusters get-credentials website-cluster --zone us-central1-f
kubectl rolling-update website --update-period=20s --image=us.gcr.io/skywatch-app/website:96.0
以下是这些命令的输出:
==> docker: Creating a temporary directory for sharing data...
==> docker: Pulling Docker image: nginx:1.9.7
docker: 1.9.7: Pulling from library/nginx
docker: d4bce7fd68df: Already exists
docker: a3ed95caeb02: Already exists
docker: a3ed95caeb02: Already exists
docker: 573113c4751a: Already exists
docker: 31917632be33: Already exists
docker: a3ed95caeb02: Already exists
docker: 1e7c116578c5: Already exists
docker: 03c02c160fd7: Already exists
docker: f852bb4464c4: Already exists
docker: a3ed95caeb02: Already exists
docker: a3ed95caeb02: Already exists
docker: a3ed95caeb02: Already exists
docker: Digest: sha256:3b50ebc3ae6fb29b713a708d4dc5c15f4223bde18ddbf3c8730b228093788a3c
docker: Status: Image is up to date for nginx:1.9.7
==> docker: Starting docker container...
docker: Run command: docker run -v /tmp/packer-docker358675979:/packer-files -d -i -t nginx:1.9.7 /bin/bash
docker: Container ID: 0594bf37edd1311535598971140535166df907b1c19d5f76ddda97c53f884d5b
==> docker: Provisioning with shell script: /tmp/packer-shell010711780
==> docker: Uploading nginx.conf => /etc/nginx/nginx.conf
==> docker: Uploading ../dist/ => /var/www
==> docker: Uploading ../dist => /skywatch/website
==> docker: Uploading /skywatch/ssl/ => /skywatch/ssl
==> docker: Committing the container
docker: Image ID: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
==> docker: Killing the container: 0594bf37edd1311535598971140535166df907b1c19d5f76ddda97c53f884d5b
==> docker: Running post-processor: docker-tag
docker (docker-tag): Tagging image: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
docker (docker-tag): Repository: us.gcr.io/skywatch-app/website:96.0
Build 'docker' finished.
==> Builds finished. The artifacts of successful builds are:
--> docker: Imported Docker image: sha256:d469880ae311d164da6786ec73afbf9190d2056accedc9d2dc186ef8ca79c4b6
--> docker: Imported Docker image: us.gcr.io/skywatch-app/website:96.0
[2016-05-16 15:09:39,598, INFO] The push refers to a repository [us.gcr.io/skywatch-app/website]
e75005ca29bf: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
0b3fbb980e2d: Preparing
40f240c1cbdb: Preparing
673cf6d9dedb: Preparing
5f70bf18a086: Preparing
ebfc3a74f160: Preparing
031458dc7254: Preparing
5f70bf18a086: Preparing
5f70bf18a086: Preparing
12e469267d21: Preparing
ebfc3a74f160: Waiting
031458dc7254: Waiting
12e469267d21: Waiting
5f70bf18a086: Layer already exists
673cf6d9dedb: Layer already exists
40f240c1cbdb: Layer already exists
0b3fbb980e2d: Layer already exists
ebfc3a74f160: Layer already exists
031458dc7254: Layer already exists
12e469267d21: Layer already exists
e75005ca29bf: Pushed
96.0: digest: sha256:ff865acd292409f3b5bf3c14494a6016a45d5ea831e5260304007a2b83e21189 size: 7328
[2016-05-16 15:09:40,483, INFO] Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-cluster.
[2016-05-16 15:10:18,823, INFO] Created website-8c10af72294bdfc4d2d6a0e680e84f09
Scaling up website-8c10af72294bdfc4d2d6a0e680e84f09 from 0 to 1, scaling down website from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling website-8c10af72294bdfc4d2d6a0e680e84f09 up to 1
Scaling website down to 0
Update succeeded. Deleting old controller: website
Renaming website-8c10af72294bdfc4d2d6a0e680e84f09 to website
replicationcontroller "website" rolling updated
这一切看起来都很好,但完成后我们在新页面上随机收到 404。当我运行 kubectl get pods 时,我发现我有三个 pod 在运行,而不是预期的两个 pod:
NAME READY STATUS RESTARTS AGE
website-8c10af72294bdfc4d2d6a0e680e84f09-iwfjo 1/1 Running 0 1d
website-keys9 1/1 Running 0 1d
website-staging-34caf57c958848415375d54214d98b8a-yo4sp 1/1 Running 0 3d
使用kubectl describe pod
命令,我确定 podwebsite-8c10af72294bdfc4d2d6a0e680e84f09-iwfjo
正在运行新版本 (96.0),而 podwebsite-keys9
正在运行旧版本 (95.0)。我们收到 404 错误是因为传入请求会随机发送到旧版本的网站。当我手动删除运行旧版本的 pod 时,404 错误消失。
有人知道在什么情况下滚动更新不会删除运行旧版本网站的 pod 吗?我是否需要在 yaml 或命令中进行某些更改,以确保始终删除运行旧版本的 pod?
感谢任何有关此的帮助或建议。
答案1
这是Kubernetes 错误 #27721。但即使不是这样,您仍然会遇到用户流量同时传送到新旧 Pod 的情况。这对于大多数应用程序来说都没问题,但对于您来说,这是不可取的,因为它会导致意外的 404。我建议您使用与旧 Pod 不同的标签集创建新 Pod,例如将图像版本放在标签中。然后,您可以更新服务以选择新标签——这将快速(不是原子性的,但很快)将所有流量从旧服务后端切换到新服务后端。
但切换到使用部署可能更容易。