跨节点 Pod 到 Pod 通信故障排除

跨节点 Pod 到 Pod 通信故障排除

跨节点 Pod 到 Pod 通信故障排除

先决条件:

我使用 VirtualBox 设置了 2 台虚拟机。VirtualBox 版本是6.1.16 r140961 (Qt5.6.2)。操作系统是 CentOS-7( CentOS-7-x86_64-Minimal-2003.iso)。一台虚拟机作为 kubernetes 主节点,名为k8s-master1。另一台虚拟机作为 kubernetes 工作节点,名为k8s-node1
连接 2 个网络接口,NAT + Host-Only。NAT 用于面向互联网。Host-Only 用于虚拟机之间的连接。

有关 的网络信息k8s-master1

[root@k8s-master1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:d1:41:4b brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::62ae:c676:da76:cbff/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:8a:2d:63 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.15/24 brd 10.0.3.255 scope global noprefixroute dynamic enp0s8
       valid_lft 80380sec preferred_lft 80380sec
    inet6 fe80::952e:9af8:a1cb:8a07/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:04:6e:c7:dc brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 06:cd:b1:24:62:fc brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::4cd:b1ff:fe24:62fc/64 scope link 
       valid_lft forever preferred_lft forever
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 06:75:06:d0:17:ee brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.1/24 brd 10.244.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::475:6ff:fed0:17ee/64 scope link 
       valid_lft forever preferred_lft forever
8: veth221fb276@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 62:b9:61:95:0f:73 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::60b9:61ff:fe95:f73/64 scope link 
       valid_lft forever preferred_lft forever

有关 的网络信息k8s-node1

[root@k8s-node1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:ba:34:4f brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.102/24 brd 192.168.56.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::660d:ec2:cb1c:49de/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:48:6d:8b brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.15/24 brd 10.0.3.255 scope global noprefixroute dynamic enp0s8
       valid_lft 80039sec preferred_lft 80039sec
    inet6 fe80::62af:7770:3b6d:6576/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:e4:65:f4:28 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 8e:d3:79:63:40:22 brd ff:ff:ff:ff:ff:ff
    inet 10.244.1.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::8cd3:79ff:fe63:4022/64 scope link 
       valid_lft forever preferred_lft forever
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:22:8d:e5:29:24 brd ff:ff:ff:ff:ff:ff
    inet 10.244.1.1/24 brd 10.244.1.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::8c22:8dff:fee5:2924/64 scope link 
       valid_lft forever preferred_lft forever
8: veth5314dbaf@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 82:58:1c:3a:a1:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::8058:1cff:fe3a:a1a3/64 scope link 
       valid_lft forever preferred_lft forever
10: calic849a8cefe4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
11: cali82f9786c604@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
12: caliaa01fae8214@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
13: cali88097e17cd0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
15: cali788346fba46@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
16: caliafdfba3871a@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever
18: cali329803e4ee5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 7
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever

我制作了一个表格来简化网络信息。

| k8s-master1                               | k8s-node1                                  |
| ----------------------------------------- | ------------------------------------------ |
| NAT               10.0.3.15/24    enp0s8  | NAT                10.0.3.15/24    enp0s8  |
| Host-Only     192.168.56.101/24    enp0s3 | Host-Only      192.168.56.102/24    enp0s3 |
| docker0        172.17.0.1/16              | docker0         172.17.0.1/16              |
| flannel.1       10.244.1.0/32             | flannel.1        10.244.1.0/32             |
| cni0               10.244.1.1/24          | cni0                10.244.1.1/24          |

问题描述

测试用例1:

将 Kubernetes Service 暴露nginx-svc为 NodePort,并且 Podnginx正在运行k8s-node1

查看nginx-svc的NodePort

[root@k8s-master1 ~]# kubectl get svc 
NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
nginx-svc               NodePort    10.103.236.60    <none>        80:30309/TCP   9h

curl http://localhost:30309运行命令k8s-master1。最后返回Connection timed out

在此处输入图片描述

运行命令curl http://localhost:30309k8s-node1它返回响应。

在此处输入图片描述

测试用例 2

我在 上创建了另一个 Pod nginx k8s-master1。 上已经有一个 Pod nginx 。 从 Pod nginx向 Pod nginxk8s-node1进行调用。 它也。 问题很可能是k8s-master1k8s-node1Connection timed out无法使用 flannel 在 kubernetes 集群中跨节点 pod 进行通信

测试用例 3

如果同一节点上有 2 个 nginx pod,无论k8s-master1k8s-node1。连接正常。pod 可以从另一个 pod 获得响应。

测试用例4:

我使用不同的 IP 地址向发出了几个ping请求。k8s-master1k8s-node1

Pingk8s-node1的仅主机 IP 地址: 连接成功

ping 192.168.56.102

k8s-node1对 flannel.1 的 IP 地址进行 Ping操作:连接失败

ping 10.244.1.0

Pingk8s-node1的 cni0 IP 地址: 连接失败

ping 10.244.1.1

测试用例5:

我在同一台笔记本电脑上使用桥接网络设置了一个新的 k8s 集群。测试用例 1、2、3、4 无法重现。没有“跨节点 Pod 到 Pod 通信”问题。

猜测

  1. 防火墙。我禁用了所有虚拟机上的防火墙。此外还禁用了笔记本电脑(Win 10)上的防火墙,包括所有防病毒软件。连接问题仍然存在。

  2. 比较 k8s 集群附加 NAT 和 Host-Only 与 k8s 集群附加桥接网络之间的 iptables。未发现任何重大差异。

    yum install -y net-tools
    iptables -L
    

解决方案

# install route
[root@k8s-master1 tzhong]# yum install -y net-tools

# View routing table after k8s cluster installed thru. kubeadm.
[root@k8s-master1 tzhong]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    100    0        0 enp0s3
10.0.2.0        0.0.0.0         255.255.255.0   U     100    0        0 enp0s3
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.56.0    0.0.0.0         255.255.255.0   U     101    0        0 enp0s8

k8s-master1重点关注下面的路由表规则。此规则意味着来自,并且目的地为 的所有请求都10.244.1.0/24将由 转发flannel.1

但是 好像 有点flannel.1问题。所以我就试试。添加一条新的路由表规则,所有来自k8s-master1,并且目的地是 的请求10.244.1.0/24都将由 转发enp0s8

# Run below commands on k8s-master1 to add routing table rules. 10.244.1.0 is the IP address of flannel.1 on k8s-node1.
route add -net 10.244.1.0 netmask 255.255.255.0 enp0s8
route add -net 10.244.1.0 netmask 255.255.255.0 gw 10.244.1.0

[root@k8s-master1 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    100    0        0 enp0s3
10.0.2.0        0.0.0.0         255.255.255.0   U     100    0        0 enp0s3
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 enp0s8
10.244.1.0      0.0.0.0         255.255.255.0   U     0      0        0 enp0s8
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.56.0    0.0.0.0         255.255.255.0   U     101    0        0 enp0s8

已添加两个额外的路由表规则。

10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 enp0s8
10.244.1.0      0.0.0.0         255.255.255.0   U     0      0        0 enp0s8

以同样的方式进行k8s-node1

# Run below commands on k8s-node1 to add routing table rules. 10.244.0.0 is the IP address of flannel.1 on k8s-master1.
route add -net 10.244.0.0 netmask 255.255.255.0 enp0s8
route add -net 10.244.0.0 netmask 255.255.255.0 gw 10.244.0.0

再次执行测试用例 1、2、3、4、5。全部通过。我很困惑,为什么?我有几个问题。

问题:

  1. 为什么连接flannel.1被阻止?有什么方法可以调查吗?

  2. 这是 flannel 的 Host-Only 网络接口的限制吗?如果我换成其他 CNI 插件,连接还能正常吗?

  3. 为什么使用 添加路由表后enp0s8,连接就正常了?

答案1

最后,我找到了根本原因。我使用 VirtualBox 设置虚拟机,有两个网络接口。一个是面向互联网的 NAT,另一个是用于 k8s 集群通信的 Host-Only。Flannel 总是使用 NAT。这是不正确的。我们需要在 kube-flannel.yaml 中配置正确的网络接口值

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-flannel-ds
  namespace: kube-system
  ...
spec:
  ...
  template:
    ...
    spec:
      ...
      containers:
      - name: kube-flannel
        image: quay.io/coreos/flannel:v0.13.1-rc1
        command:
        - /opt/bin/flanneld
        args:
        - --ip-masq
        - --kube-subnet-mgr
        # THIS LINE
        - --iface=enp0s8

相关内容