nginx-ingress 控制器每隔几天就会崩溃

nginx-ingress 控制器每隔几天就会崩溃

我在裸机 CentOs 上有一个单机(未受污染)Kubernetes 集群。
我将其用作nginx-ingress-controller网关。我使用的图像来自https://quay.io/repository/kubernetes-ingress-controller/nginx-ingress-controller。我使用的是 版本0.13.0,最近因为崩溃问题升级到了 版本0.14.0。不幸的是,这没有帮助。

服务器运行良好,大约 3 天后,入口控制器将进入该CrashLoopBackOff状态。

我准备了一个服务,它每分钟尝试访问集群处理的 URL,如果它发现无法访问,它会捕获日志和有关 pod 的信息并通过电子邮件发送给我。如果集群在 5 分钟后仍未恢复,则执行系统重启。

所以昨天它又失败了,首先它在 2 分钟后恢复了,所以没有执行重新启动。所有 pod 都处于运行状态,以下是日志:

May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.845990 2556 remote_runtime.go:278] ContainerStatus "bdb53e54dbcc9250663b64db5827276efba7012b50edc1351195d34b87e46529" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.849158 2556 kuberuntime_container.go:416] ContainerStatus for bdb53e54dbcc9250663b64db5827276efba7012b50edc1351195d34b87e46529 error: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.849184 2556 kuberuntime_manager.go:874] getPodContainerStatuses for pod "nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)" failed: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.850157 2556 generic.go:241] PLEG: Ignoring events for pod nginx-ingress-controller-65b9795548-br445/ingress-nginx: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.851766 2556 pod_workers.go:186] Error syncing pod 0438c326-4d0e-11e8-ad63-005056b1f077 ("nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"), skipping: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:57 web-backend kubelet[2556]: I0515 07:33:57.696417 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"
May 15 07:34:05 web-backend kubelet[2556]: W0515 07:34:05.779589 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:15 web-backend kubelet[2556]: W0515 07:34:15.784498 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:25 web-backend kubelet[2556]: W0515 07:34:25.820287 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:28 web-backend kubelet[2556]: W0515 07:34:28.577852 2556 pod_container_deletor.go:77] Container "e12f6be27df976112456671d640db9711cc5546127eee47fa52a0118db9fc573" not found in pod's containers

第二次故障发生在20分钟后,现在还没有恢复,所以5分钟后就重启了。
日志:

May 15 07:57:05 web-backend kubelet[2556]: I0515 07:57:05.946879 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)
May 15 07:57:05 web-backend kubelet[2556]: E0515 07:57:05.946919 2556 pod_workers.go:186] Error syncing pod eb6191db-4fab-11e8-bc40-005056b1f077 ("nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946715 2556 kuberuntime_manager.go:514] Container {Name:nginx-ingress-controller Image:quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0 Command:[] Args:[/nginx-ingress-controller --default-backend-service=$(POD_NAMESPACE)/default-http-backend --configmap=$(POD_NAMESPACE)/nginx-configuration --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services --udp-services-configmap=$(POD_NAMESPACE)/udp-services --annotations-prefix=nginx.ingress.kubernetes.io] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:} {Name:https HostPort:0 ContainerPort:443 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:nginx-ingress-serviceaccount-token-x87kw ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946852 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946975 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)
May 15 07:57:11 web-backend kubelet[2556]: E0515 07:57:11.947002 2556 pod_workers.go:186] Error syncing pod eb691c72-4fab-11e8-bc40-005056b1f077 ("nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958471 2556 kuberuntime_manager.go:514] Container {Name:nginx-ingress-controller Image:quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0 Command:[] Args:[/nginx-ingress-controller --default-backend-service=$(POD_NAMESPACE)/default-http-backend --configmap=$(POD_NAMESPACE)/nginx-configuration --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services --udp-services-configmap=$(POD_NAMESPACE)/udp-services --annotations-prefix=nginx.ingress.kubernetes.io] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:} {Name:https HostPort:0 ContainerPort:443 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:nginx-ingress-serviceaccount-token-x87kw ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958587 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958737 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)
May 15 07:57:17 web-backend kubelet[2556]: E0515 07:57:17.958767 2556 pod_workers.go:186] Error syncing pod 0438c326-4d0e-11e8-ad63-005056b1f077 ("nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"

而这一次,nginx-ingress 控制器 pod 进入了 CrashLoopBackOff 状态,以下是有关它的一些详细信息kubectl describe pod

Name:           nginx-ingress-controller-65b9795548-4ddm7
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Fri, 04 May 2018 08:00:44 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.17
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://ad9447979d549757449cf26de19ba2485e39abbd32b9dd69cc9a18d27a48002d
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:27 -0700
      Finished:     Tue, 15 May 2018 07:56:19 -0700
    Ready:          False
    Restart Count:  54
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-4ddm7 (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Name:           nginx-ingress-controller-65b9795548-8nznl
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Fri, 04 May 2018 08:00:44 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.32
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://57c588250fe42266d9c31494f8dd9c12b970f29f040d2975ee121279ed6af470
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:24 -0700
      Finished:     Tue, 15 May 2018 07:56:06 -0700
    Ready:          False
    Restart Count:  63
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-8nznl (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Name:           nginx-ingress-controller-65b9795548-br445
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Tue, 01 May 2018 00:05:26 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.4
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://bd1db5de818e982b213b0ae2fe4e6208a0a4ec7bdc2d0a3c483867d63bd9fd76
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:26 -0700
      Finished:     Tue, 15 May 2018 07:56:22 -0700
    Ready:          False
    Restart Count:  56
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-br445 (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

我的 kubelet 版本:Kubernetes v1.9.3
Docker:Docker version 1.12.6, build 3e8e77d/1.12.6
操作系统:CentOS Linux release 7.2.1511 (Core)

重启后(大约需要 2 分钟才能恢复一切),它将在接下来的约 3 天内无任何中断地运行。

我注意到只有 nginx-ingress-controller 出现故障,但这基本上意味着整个集群都崩溃了,无法通过 Web 浏览器提供服务。

我注意到日志中出现了奇怪的错误消息:

May 15 05:28:15 web-backend kubelet[2556]: E0515 05:28:15.672576 2556 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "web-backend": Get https://10.202.91.129:6443/api/v1/nodes/web-backend: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

由于这是一个单节点集群,它看起来像是 apiserver 的问题,但我不知道如何解决它。

我已经将 nginx-ingress-controller 扩展到 3 个副本,但这没有帮助。当时可能只有一个副本发生故障,但无论如何所有流量都会定向到它。

这是我在 POD 日志中下次崩溃后发现的内容:

kubectl --namespace=ingress-nginx logs nginx-ingress-controller-65b9795548-4ddm7 --since 15m
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.14.0
  Build:      git-734361d
  Repository: https://github.com/kubernetes/ingress-nginx
-------------------------------------------------------------------------------
W0518 12:31:30.646157       7 client_config.go:533] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0518 12:31:30.674673       7 main.go:181] Creating API client for https://10.96.0.1:443
I0518 12:31:34.857433       7 main.go:225] Running in Kubernetes Cluster version v1.9 (v1.9.3) - git (clean) commit d2835416544f298c919e2ead3be3d0864b52323b - platform linux/amd64
I0518 12:31:35.650003       7 main.go:84] validated ingress-nginx/default-http-backend as the default backend
I0518 12:31:37.486330       7 stat_collector.go:77] starting new nginx stats collector for Ingress controller running in namespace  (class nginx)
I0518 12:31:37.490876       7 stat_collector.go:78] collector extracting information from port 18080
I0518 12:31:37.994480       7 nginx.go:278] starting Ingress controller
I0518 12:31:38.289328       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"nginx-configuration", UID:"c56ae692-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"6055322", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/nginx-configuration
I0518 12:31:38.547714       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"tcp-services", UID:"ca14ac3e-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"4116575", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/tcp-services
I0518 12:31:38.547931       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"udp-services", UID:"cd74ece7-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"4116581", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/udp-services
I0518 12:31:39.518536       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"web", Name:"web-incubator", UID:"02a43f7c-37d2-11e8-937e-005056b1f077", APIVersion:"extensions", ResourceVersion:"7517344", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress web/web-incubator
I0518 12:31:39.537898       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"web", Name:"web-production", UID:"64f13208-3765-11e8-937e-005056b1f077", APIVersion:"extensions", ResourceVersion:"7517346", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress web/web-production
I0518 12:31:39.612898       7 nginx.go:299] starting NGINX process...
I0518 12:31:39.857564       7 leaderelection.go:175] attempting to acquire leader lease  ingress-nginx/ingress-controller-leader-nginx...
I0518 12:31:39.976074       7 status.go:196] new leader elected: nginx-ingress-controller-65b9795548-8nznl
I0518 12:31:40.018482       7 controller.go:168] backend reload required
I0518 12:31:40.028976       7 stat_collector.go:34] changing prometheus collector from  to default
I0518 12:31:48.350483       7 controller.go:177] ingress backend successfully reloaded...
I0518 12:31:55.369911       7 main.go:150] Received SIGTERM, shutting down
I0518 12:31:55.697116       7 nginx.go:362] shutting down controller queues
I0518 12:31:56.366677       7 nginx.go:370] stopping NGINX process...

知道可能是什么原因导致了这些问题吗?

答案1

Kubernetes是一个自我修复、高可用性和负载平衡的环境。即使只有一个节点,您仍然可以使用集群解决方案来实现最佳性能。在这种情况下,所有软件组件都在一台机器上运行。因此有很多软件在运行 - 您的应用程序映像和 Kubernetes 进程。这需要为该配置的每个方面都提供足够的资源。

如果你的应用程序(例如 Docker 镜像)资源耗尽怎么办?Kubernetes 会将其杀死。耗尽内存限制将导致OOO 行为运行并终止容器。

在您提供的日志中我发现了以下条目:

I0518 12:31:55.369911 7 main.go:150] 收到 SIGTERM,正在关闭

因此,由于资源耗尽,Kubernetes 决定终止 Ingress 控制器。尝试扩展机器(或虚拟机)的内存,并观察是否有帮助。

另一个选择是启动另一个节点并将集群从一个节点扩展到多节点,并启用 HA 和负载平衡等所有功能,以避免此类问题。

答案2

这是由于用户允许打开的文件数量限制造成的nobody

prometheus 和 Nginx-ingress 都以nobody用户身份运行。由于 Prometheus 保持大量文件句柄打开,因此没有足够的空间让 Nginx 正常运行。

添加一个包含以下内容的文件/etc/security/limits.d/

nobody soft nofile 4096

当你这样做时,你可能还需要一个/etc/sysctl.d/可以增加其他一些限制的文件:

# Increase number of watches (Kubernetes/Docker/IMAP)
fs.inotify.max_user_watches: 16384  # default 8192
fs.inotify.max_user_instances: 1024  # was 128

这应该可以确保 Nginx 持续运行。


最后,当您的系统接收许多传入连接时,也应该进行调整:

# Full speed on idle connection
net.ipv4.tcp_slow_start_after_idle: 0

# Handle more concurrent connections, resize listen() backlog
net.core.somaxconn: 1024

# Increase connection tracking, or face dropping packets (seen in dmesg).
# Happened at servers that receives many connections from statsd/collectd
net.nf_conntrack_max: 65536

# Increase the number of outstanding syn requests allowed.
net.ipv4.tcp_max_syn_backlog: 2048
net.ipv4.tcp_syncookies: 1

# Buffers: min default max (16MB buffer at 50ms RTT = 320MB/s max rate)
net.ipv4.tcp_rmem: 4096 65536 16777216
net.ipv4.tcp_wmem: 4096 65536 16777216

# Lower TIME_WAIT / FIN timeout
net.ipv4.tcp_fin_timeout: 20
net.ipv4.tcp_tw_reuse: 1
net.ipv4.tcp_max_tw_buckets: 131072

http://cdn.oreillystatic.com/en/assets/1/event/94/Tuning%20TCP%20For%20The%20Web%20Presentation.pdf了解网络值的详细信息。

相关内容