在 GKE 日志中检测 Kubernetes OOMKilled 事件

在 GKE 日志中检测 Kubernetes OOMKilled 事件

我想为 OOMKilled 事件设置检测,检查 pod 时它看起来像这样:

Name:   pnovotnak-manhole-123456789-82l2h
Namespace:  test
Node:   test-cluster-cja8smaK-oQSR/10.x.x.x
Start Time: Fri, 03 Feb 2017 14:34:57 -0800
Labels:   pod-template-hash=123456789
    run=pnovotnak-manhole
Status:   Running
IP:   10.x.x.x
Controllers:  ReplicaSet/pnovotnak-manhole-123456789
Containers:
  pnovotnak-manhole:
    Container ID: docker://...
    Image:    pnovotnak/it
    Image ID:   docker://sha256:...
    Port:
    Limits:
      cpu:  2
      memory: 3Gi
    Requests:
      cpu:    200m
      memory:   256Mi
    State:    Running
      Started:    Fri, 03 Feb 2017 14:41:12 -0800
    Last State:   Terminated
      Reason:   OOMKilled
      Exit Code:  137
      Started:    Fri, 03 Feb 2017 14:35:08 -0800
      Finished:   Fri, 03 Feb 2017 14:41:11 -0800
    Ready:    True
    Restart Count:  1
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tder (ro)
    Environment Variables:  <none>
Conditions:
  Type    Status
  Initialized   True
  Ready   True
  PodScheduled  True
Volumes:
  default-token-46euo:
    Type: Secret (a volume populated by a Secret)
    SecretName: default-token-tder
QoS Class:  Burstable
Tolerations:  <none>
Events:
  FirstSeen LastSeen  Count From                SubObjectPath       Type    Reason    Message
  --------- --------  ----- ----                -------------       --------  ------    -------
  11m   11m   1 {default-scheduler }                      Normal    Scheduled Successfully assigned pnovotnak-manhole-123456789-82l2h to test-cluster-cja8smaK-oQSR
  10m   10m   1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Created   Created container with docker id xxxxxxxxxxxx; Security:[seccomp=unconfined]
  10m   10m   1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Started   Started container with docker id xxxxxxxxxxxx
  11m   4m    2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Pulling   pulling image "pnovotnak/it"
  10m   4m    2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Pulled    Successfully pulled image "pnovotnak/it"
  4m    4m    1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Created   Created container with docker id yyyyyyyyyyyy; Security:[seccomp=unconfined]
  4m    4m    1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole}  Normal    Started   Started container with docker id yyyyyyyyyyyy

我从 pod 日志中获得的全部信息是;

{
 textPayload: "shutting down, got signal: Terminated
"
 insertId: "aaaaaaaaaaaaaaaa"
 resource: {
  type: "container"
  labels: {
   pod_id: "pnovotnak-manhole-123456789-82l2h"
   ...
  }
 }
 timestamp: "2017-02-03T22:34:48Z"
 severity: "ERROR"
 labels: {
  container.googleapis.com/container_name: "POD"
  ...
 }
 logName: "projects/myproj/logs/POD"
}

以及 kublet 日志;

{
 insertId: "bbbbbbbbbbbbbb"   
 jsonPayload: {
  _BOOT_ID: "ffffffffffffffffffffffffffffffff"    
  MESSAGE: "I0203 22:41:11.925928    1843 kubelet.go:1816] SyncLoop (PLEG): "pnovotnak-manhole-123456789-82l2h_test(a-uuid)", event: &pleg.PodLifecycleEvent{ID:"another-uuid", Type:"ContainerDied", Data:"..."}"
 ...

这似乎不足以唯一地将其标识为 OOM 事件。还有其他想法吗?

答案1

尽管日志中没有出现 OOMKilled 事件,但如果你能检测到 pod 被杀死,那么你可以使用kubectl get pod -o go-template=... <pod-id>以确定原因。作为直接来自文档

[13:59:01] $ ./cluster/kubectl.sh  get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}'  simmemleak-60xbc
Container Name: simmemleak
LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]]

如果你以编程方式执行此操作,那么依赖kubectl输出的更好替代方案是使用 Kubernetes REST APIGET /api/v1/pods方法。访问 API 的方法还包括文档中给出

相关内容