K8s 失败的作业继续运行(持续时间正在计数)

K8s 失败的作业继续运行(持续时间正在计数)

我定义了一个测试工作:

apiVersion: batch/v1
kind: Job
metadata:
  name: testjob
spec:
  activeDeadlineSeconds: 100
  backoffLimit: 3
  template:
    spec:
      containers:
      - name: testjob
        image: bitnami/kubectl:1.20
        imagePullPolicy: IfNotPresent
        command:
        - /bin/sh
        - -c
        - echo "Test" && exit 1
      restartPolicy: Never

所有的 pod 都“正常”失败,但是作业的持续时间计数器不会停止。

$ kubectl get pods,jobs
NAME                                            READY   STATUS    RESTARTS   AGE
pod/testjob-s2cbf                               0/1     Error     0          3m15s
pod/testjob-nhfgn                               0/1     Error     0          3m14s
pod/testjob-8jw74                               0/1     Error     0          3m4s
pod/testjob-jh7hl                               0/1     Error     0          2m24s

NAME                COMPLETIONS   DURATION   AGE
job.batch/testjob   0/1           3m15s      3m15s
$ kubectl describe job testjob
Name:                     testjob
Namespace:                default
Selector:                 controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
Labels:                   controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
                          job-name=testjob
Annotations:              <none>
Parallelism:              1
Completions:              1
Start Time:               Wed, 17 Mar 2021 18:13:56 +0000
Active Deadline Seconds:  100s
Pods Statuses:            0 Running / 0 Succeeded / 4 Failed
Pod Template:
  Labels:  controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
           job-name=testjob
  Containers:
   testjob:
    Image:      bitnami/kubectl:1.20
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      echo "Test" && exit 1
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      4m11s  job-controller  Created pod: testjob-s2cbf
  Normal   SuccessfulCreate      4m10s  job-controller  Created pod: testjob-nhfgn
  Normal   SuccessfulCreate      4m     job-controller  Created pod: testjob-8jw74
  Normal   SuccessfulCreate      3m20s  job-controller  Created pod: testjob-jh7hl
  Warning  BackoffLimitExceeded  2m     job-controller  Job has reached the specified backoff limit

但是,如果其中一个 pod 成功完成(状态:已完成),则持续时间计数器将按预期停止。

这里有什么问题?

答案1

如果作业成功完成(type=Complete),它将.status.completionTime被设置为特定日期。如果作业为Failedtype=Failed),则它.status.completionTime根本没有设置,因此DURATION会不断增加(说实话,我不确定这是否是一个错误)。


我创建了一个简单的示例来说明它是如何工作的。

我有两份工作:(testjobtype=Failedtestjob-2type=Complete):

$ kubectl get jobs
NAME        COMPLETIONS   DURATION   AGE
testjob     0/1           3m15s      3m15s
testjob-2   1/1           1s         2m49s

我们可以使用以下选项显示更多信息-o custom-columns=
笔记:正如您所见,.status.completionTime没有为失败的作业设置。

$ kubectl get jobs testjob testjob-2 -o custom-columns=NAME:.metadata.name,TYPE:.status.conditions[].type,REASON:.status.conditions[].reason,COMPLETIONTIME:.status.completionTime
NAME        TYPE       REASON                 COMPLETIONTIME
testjob     Failed     BackoffLimitExceeded   <none>
testjob-2   Complete   <none>                 2021-03-23T15:51:33Z

此外,您还可以在 Github 上找到有用的信息:作业状态的 API 文档

相关内容