kubernetes 中的 pod 无法与其他 pod 和外部集群主机进行通信

2024-6-1 • tag-icon

networking kubernetes

kubernetes 中的 pod 无法与其他 pod 和外部集群主机进行通信

我有 2 个主节点和 3 个工作节点以及一个用于 controlePlan 的 HA 代理，我有许多 java 微服务，它们相互通信并与 kubernetes 集群之外的 DB 或 KAFKA 通信。网络访问是所有主机中的任何一个开放。我为每个微服务创建部署。但是当我执行到容器时，在 pod 和 kubernetes 集群之外的 DB 或 KAFKA 之间的端口上没有 tcp 连接。

从主机我可以通过 telnet 连接到 DB 或 KAFKA 但在容器荚中我无法访问。

来自主持人：

[root@master1 ~]# telnet oracle.local 1521
Trying 192.198.10.30...
Connected to oracle.local.
Escape character is '^]'.
^C^CConnection closed by foreign host.
[root@master1 ~]#

例如来自 pod busybux ：

[root@master1 ~]# kubectl run -i --tty busybox --image=busybox --restart=Never -- sh
If you don't see a command prompt, try pressing enter.
/ # telnet 192.168.10.30 1521
telnet: can't connect to remote host (192.168.10.30): Connection timed out

集群状态：

[root@master1 ~]# kubectl get nodes -o wide
NAME                 STATUS   ROLES                  AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION                  CONTAINER-RUNTIME
master1.project.co   Ready    control-plane,master   11d   v1.22.2   192.168.10.1    <none>        Oracle Linux Server 8.3   5.4.17-2011.7.4.el8uek.x86_64   containerd://1.4.9
master2.project.co   Ready    control-plane,master   11d   v1.22.2   192.168.10.2    <none>        Oracle Linux Server 8.3   5.4.17-2011.7.4.el8uek.x86_64   containerd://1.4.9
worker1.project.co   Ready    <none>                 11d   v1.22.2   192.168.10.3   <none>        Oracle Linux Server 8.3   5.4.17-2011.7.4.el8uek.x86_64   containerd://1.4.9
worker2.project.co   Ready    <none>                 11d   v1.22.2   192.168.10.4   <none>        Oracle Linux Server 8.3   5.4.17-2011.7.4.el8uek.x86_64   containerd://1.4.9
worker3.project.co   Ready    <none>                 11d   v1.22.2   192.168.10.5   <none>        Oracle Linux Server 8.3   5.4.17-2011.7.4.el8uek.x86_64   containerd://1.4.9

描述 pod busybox：

[root@master1 ~]# kubectl describe pod busybox
Name:         busybox
Namespace:    default
Priority:     0
Node:         worker3.project.co/192.168.10.5
Start Time:   Sat, 02 Oct 2021 10:27:05 +0330
Labels:       run=busybox
Annotations:  cni.projectcalico.org/containerID: 75d7222e8f402c68d9161a7b399df2de6b45e7194b2bb3b0b2730adbdac680c4
              cni.projectcalico.org/podIP: 192.168.205.76/32
              cni.projectcalico.org/podIPs: 192.168.205.76/32
Status:       Pending
IP:
IPs:          <none>
Containers:
  busybox:
    Container ID:
    Image:         busybox
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      sh
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-69snv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-69snv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  21s   default-scheduler  Successfully assigned default/busybox to worker3.project.co
  Normal  Pulling    20s   kubelet            Pulling image "busybox"

答案1

造成这种情况的原因有很多。如果不了解集群和网络架构，这个问题就无法解决，不过，这里有一些想法：

检查是否有网络策略通过执行来应用kubectl -n <namespace> get netpol。NetworkPolicies 可以限制集群内部和外部的通信。
让 Pod 运行hostNetwork: true选项（不是在生产中完成，只是作为测试）并再次尝试一些连接测试（双向）。
通过跟踪网络调用来检查集群的网络是否配置正确。路由器是否配置正确并且可以被集群内的应用程序使用？
检查您的network access is any to any open in all hosts陈述是否属实，这可能是防火墙配置的问题。

奖励：你似乎只有 2 个主节点毫无意义如果 etcd 在 Kubernetes 集群中运行（kubectl -n kube-system get pods | grep etcd如果是这种情况，将显示 2 个 pod）。拥有 2 个 etcd 成员可以为你提供与 1 个节点集群完全相同的容错能力，但你会浪费资源来拥有另一个占用内存、CPU 等的 VM。考虑将主节点增加到 3 个，以便容错能力达到 1。必须始终有大部分正在运行的 etcd 集群。请记住，2 的多数仍然是 2。

相关内容