超出 GKE 上下文截止期限：CreateContainerError 且未能保留容器名称

2024-6-1 • tag-icon

超出 GKE 上下文截止期限：CreateContainerError 且未能保留容器名称

我正在运行 GKE 集群，有时其中一个节点在使用时会出现特定容器构建的问题php7-alpine。

我们运行两种类型的容器，第一种类型是从构建的php7-alpine，第二种类型是从第一种类型构建的。（php7-alpine-> Base App-> App with extra）。只有我们的Base App Pods有这些问题。

到目前为止，我发现了以下错误：

failed to reserve container name
FailedSync: error determining status: rpc error: code = Unknown desc = Error: No such container: XYZ
Error: context deadline exceeded context deadline exceeded: CreateContainerError

节点上剩余大量磁盘空间，kubectl describe pod不包含任何相关/有用的信息。

更多细节：

在 50 个 pod 中Base app，有 6 个 pod 出现错误，并且所有App with extrapod 中没有一个出现故障。
所有故障的 pod 始终位于同一节点上。
我们已经重新创建/替换了节点。问题仍然存在，如果我们用有故障的 pod 替换节点，则下一个节点上所有 pod 都正常的可能性为 50/50%。问题似乎有点随机。
运行 GKE v1.17.9-gke.1504
我们正在可抢占节点上运行。
容器镜像相当大 (~3gb，正在努力减小)。
这个问题大概在一个月前开始出现。

我真的不知道该找什么，我找了好久才找到类似的问题。任何帮助我都非常感谢！

更新：

以下是部署

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: my-app
    appType: web
    env: prod
  name: my-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-app
        version: v1.0
    spec:
      containers:
          image: richarvey/nginx-php-fpm:latest  # We build upon that image to add content and services
          lifecycle:
            preStop:
              exec:
                command:
                  - /entry-point/stop.sh
          name: web
          ports:
            - containerPort: 80
              protocol: TCP
          resources:
            requests:
              cpu: 50m
              memory: 1500Mi
        - image: redis:4.0-alpine
          name: redis
          resources:
            requests:
              cpu: 25m
              memory: 25Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

答案1

该问题已调查并解决。

https://github.com/containerd/containerd/issues/4604

答案1

相关内容