过去几个月我们一直遇到一个令我们痛苦的问题。
问题似乎是,当我们通过 Kubernetes 执行器请求 pod 时,它偶尔无法创建。
例如,spark pod 可能会失败并出现以下错误:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 20m (x3 over 32m) kubelet, k8s-agentpool1-123456789-vmss00000q (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "spark-worker-cc1d28bf3de8428a826c04471e58487c-8577d5d654-2jg89": operation timeout: context deadline exceeded
Normal SandboxChanged 16m (x150 over 159m) kubelet, k8s-agentpool1-123456789-vmss00000q Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 5m7s (x14 over 161m) kubelet, k8s-agentpool1-123456789-vmss00000q Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "spark-worker-cc1d28bf3de8428a826c04471e58487c-8577d5d654-2jg89": operation timeout: context deadline exceeded
查看日志,我们看到 kubelet 的“SyncLoop”请求一个新的 pod。
Jul 16 16:33:58 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:33:58.001997 4797 kubelet.go:1908] SyncLoop (ADD, "api"): "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)
还有一些日志表明卷已安装......
Jul 16 16:34:29 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:33:58.175573 4797 reconciler.go:252] operationExecutor.MountVolume started for volume "default-shared" (UniqueName: "kubernetes.io/glusterfs/8272d74f-a7e7-11e9-8f1c-000d3a7b202b-default-shared") pod "d9b3911585c4461c9728aefa39716c44" (UID: "8272d74f-a7e7-11e9-8f1c-000d3a7b202b")
我们看到一个 pod Sandbox 即将被创建:
Jul 16 16:34:29 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:34:29.627374 4797 kuberuntime_manager.go:397] No sandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" can be found. Need to start a new one
我们似乎没有看到任何其他东西,直到我们看到这个记录:
Jul 16 16:36:29 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:29.629252 4797 kuberuntime_manager.go:662] createPodSandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "d9b3911585c4461c9728aefa39716c44": operation timeout: context deadline exceeded
Jul 16 16:36:29 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:29.629301 4797 pod_workers.go:190] Error syncing pod 8272d74f-a7e7-11e9-8f1c-000d3a7b202b ("d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)"), skipping: failed to "CreatePodSandbox" for "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" with CreatePodSandboxError: "CreatePodSandbox for pod \"d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"d9b3911585c4461c9728aefa39716c44\": operation timeout: context deadline exceeded"
Jul 16 16:36:43 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:36:43.937085 4797 kuberuntime_manager.go:397] No sandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" can be found. Need to start a new one
Jul 16 16:36:43 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:43.940691 4797 remote_runtime.go:96] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "d9b3911585c4461c9728aefa39716c44": Error response from daemon: Conflict. The container name "/k8s_POD_d9b3911585c4461c9728aefa39716c44_default_8272d74f-a7e7-11e9-8f1c-000d3a7b202b_0" is already in use by container "2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27". You have to remove (or rename) that container to be able to reuse that name.
Jul 16 16:36:43 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:43.940731 4797 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "d9b3911585c4461c9728aefa39716c44": Error response from daemon: Conflict. The container name "/k8s_POD_d9b3911585c4461c9728aefa39716c44_default_8272d74f-a7e7-11e9-8f1c-000d3a7b202b_0" is already in use by container "2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27". You have to remove (or rename) that container to be able to reuse that name.
Jul 16 16:36:43 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:43.940747 4797 kuberuntime_manager.go:662] createPodSandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "d9b3911585c4461c9728aefa39716c44": Error response from daemon: Conflict. The container name "/k8s_POD_d9b3911585c4461c9728aefa39716c44_default_8272d74f-a7e7-11e9-8f1c-000d3a7b202b_0" is already in use by container "2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27". You have to remove (or rename) that container to be able to reuse that name.
Jul 16 16:36:43 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: E0716 16:36:43.940805 4797 pod_workers.go:190] Error syncing pod 8272d74f-a7e7-11e9-8f1c-000d3a7b202b ("d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)"), skipping: failed to "CreatePodSandbox" for "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" with CreatePodSandboxError: "CreatePodSandbox for pod \"d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"d9b3911585c4461c9728aefa39716c44\": Error response from daemon: Conflict. The container name \"/k8s_POD_d9b3911585c4461c9728aefa39716c44_default_8272d74f-a7e7-11e9-8f1c-000d3a7b202b_0\" is already in use by container \"2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27\". You have to remove (or rename) that container to be able to reuse that name."
Jul 16 16:36:44 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: W0716 16:36:44.221607 4797 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "d9b3911585c4461c9728aefa39716c44_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27"
Jul 16 16:36:44 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:36:44.222749 4797 kubelet.go:1953] SyncLoop (PLEG): "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)", event: &pleg.PodLifecycleEvent{ID:"8272d74f-a7e7-11e9-8f1c-000d3a7b202b", Type:"ContainerDied", Data:"2a7ecfd3725bbe6604b3006abf6c59a36eb8a5d7142e71a3791f5f7378bf5e27"}
Jul 16 16:36:44 k8s-agentpool1-123456789-vmss00000T kubelet[4797]: I0716 16:36:44.739387 4797 kuberuntime_manager.go:415] No ready sandbox for pod "d9b3911585c4461c9728aefa39716c44_default(8272d74f-a7e7-11e9-8f1c-000d3a7b202b)" can be found. Need to start a new one
我们认为这可能是 CNI 错误,但是我们检查了 Azure CNI 日志,它似乎甚至没有到达开始请求 IP 的部分,只显示 DEL 命令以 err 完成的错误。
2019/07/16 16:23:06 [net] Deleting veth pair azv4ea5d9d9527 eth0.
2019/07/16 16:23:06 [net] Deleted endpoint &{Id:8e963d34-eth0 HnsId: SandboxKey: IfName:eth0 HostIfName:azv4ea5d9d9527 MacAddress:ce:93:bf:4d:e9:19 InfraVnetIP:{IP:<nil> Mask:<nil>} IPAddresses:[{IP:10.250.18.22
5 Mask:fffff800}] Gateways:[10.250.16.1] DNS:{Suffix: Servers:[168.63.129.16]} Routes:[{Dst:{IP:0.0.0.0 Mask:00000000} Src:<nil> Gw:10.250.16.1 Protocol:0 DevName: Scope:0}] VlanID:0 EnableSnatOnHost:false Enabl
eInfraVnet:false EnableMultitenancy:false NetworkNameSpace:/proc/10781/ns/net ContainerID:8e963d340597f1c9f789b93a7784e8d44ffb00687086de8ee6561338aab7c72d PODName:jupyter-some-person-10 PODNameSpace:defaul
t InfraVnetAddressSpace:}.
2019/07/16 16:23:06 [net] Save succeeded.
2019/07/16 16:23:06 [cni] Calling plugin azure-vnet-ipam DEL nwCfg:&{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: InfraVnetAddressSpace: PodNamespaceForDualN
etwork:[] MultiTenancy:false EnableSnatOnHost:false EnableExactMatchForPodName:false CNSUrl: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet:10.250.16.0/21 Address:10.250.18.225 QueryInterval:} DNS:{Na
meservers:[] Domain: Search:[] Options:[]} RuntimeConfig:{PortMappings:[] DNS:{Servers:[] Searches:[] Options:[]}} AdditionalArgs:[]}.
2019/07/16 16:23:06 [cni] Plugin azure-vnet-ipam returned err:<nil>.
2019/07/16 16:23:06 Get number of endpoints for ifname eth0 network azure
2019/07/16 16:23:06 [cni-net] DEL command completed with err:<nil>.
2019/07/16 16:23:06 [cni-net] Plugin stopped.
2019/07/16 16:36:38 [cni-net] Plugin azure-vnet version v1.0.18.
2019/07/16 16:36:38 [cni-net] Running on Linux version 4.15.0-1040-azure (buildd@lgw01-amd64-030) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #44-Ubuntu SMP Thu Feb 21 14:24:01 UTC 2019
Client:
Version: 3.0.3
API version: 1.40
Go version: go1.11.4
Git commit: 48bd4c6d
Built: Wed Jan 23 16:17:56 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 3.0.4
API version: 1.40 (minimum version 1.12)
Go version: go1.11.4
Git commit: 8ecd530
Built: Fri Jan 25 01:45:38 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.2.2
GitCommit: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc:
Version: 1.0.0-rc6+dev
GitCommit: 96ec2177ae841256168fcf76954f7177af9446eb
docker-init:
Version: 0.18.0
GitCommit: fec3683
以下是 k8s 版本信息:
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
这是我们的节点信息 - 我们正在使用 AKS 引擎创建一个使用 Azure VMSS 节点的 Kubernetes 集群。
Kernel Version: 4.15.0-1050-azure
OS Image: Ubuntu 16.04.6 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://3.0.4
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
我不知道我们能做些什么来进一步解决这个问题,因为我们甚至无法重现这个问题。
我知道这个“上下文截止时间超出”警报是通用的 gRPC 超时,但我不确定哪个 gRPC 事务被挂起了。
答案1
该问题是 Kubernetes/Linux 内核的一个错误,这里讨论的是 cgroups 呈指数增长并导致包括高 CPU/内存使用率在内的问题。
该问题描述这里:
这里记录了一种解决方法/修复方法: 修复 cgroup 泄漏
修复方法是使用一个守护进程,每小时运行一次清理任务。