如何调试无法使用负载均衡器 IP 从 Kubernetes 集群(本地)外部访问服务

如何调试无法使用负载均衡器 IP 从 Kubernetes 集群(本地)外部访问服务

我有一个内部部署的 Kubernetes(v1.24)集群设置,其中 cri-o 作为容器运行时,calico 作为 cni,metallb 作为负载均衡器 IP。

Masters 和 Workers OS 在 rockylinux9 上运行,启用了 selinux 但禁用了 firewalld,并且 kube 代理使用 ipvs 模式。

我已经与 Mikrotik 路由器设置了 BGP 对等连接,并看到我设置的 IP 范围通过路由部分被通告给路由器,其中有一条记录显示

10.16.0.0/28 reachable through bridge

我测试的 Nginx 服务外部 IP 是10.16.0.1,从 master、worker 和 pod 内 curl 该 IP 可以工作,它会返回默认的 Nginx 欢迎页面,但是从我的笔记本电脑 curl 时它只是挂起直到超时,而 selinux 审计日志没有显示任何违规行为

nmap ip 显示端口 80 已打开,ping ip 也有效,traceroute 也显示正确响应,并且为了检查疯狂程度,我还删除了该服务并再次执行 nmap、ping 和 traceroute,它停止按预期工作。

# commands below runs on my laptop
# that connected to local network
# but result also same as I run on
# other devices on the network

# ---

❯ nmap -T4 10.16.0.1
Starting Nmap 7.94 ( https://nmap.org ) at 2023-09-23 11:01 +07
Nmap scan report for 10.16.0.1
Host is up (0.0048s latency).
Not shown: 997 closed tcp ports (conn-refused)
PORT    STATE    SERVICE
22/tcp  open     ssh
80/tcp  filtered http
179/tcp open     bgp

# ---

# the 192.168.88.43 is local master node ip
❯ ping 10.16.0.1
PING 10.16.0.1 (10.16.0.1): 56 data bytes
64 bytes from 10.16.0.1: icmp_seq=0 ttl=64 time=9.144 ms
92 bytes from 192.168.88.1: Redirect Host(New addr: 192.168.88.43)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 c3b4   0 0000  3f  01 94cc 192.168.88.111  10.16.0.1

64 bytes from 10.16.0.1: icmp_seq=1 ttl=64 time=3.003 ms
92 bytes from 192.168.88.1: Redirect Host(New addr: 192.168.88.43)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 c9cc   0 0000  3f  01 8eb4 192.168.88.111  10.16.0.1

64 bytes from 10.16.0.1: icmp_seq=2 ttl=64 time=3.209 ms
92 bytes from 192.168.88.1: Redirect Host(New addr: 192.168.88.43)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 305b   0 0000  3f  01 2826 192.168.88.111  10.16.0.1

64 bytes from 10.16.0.1: icmp_seq=3 ttl=64 time=2.557 ms
92 bytes from 192.168.88.1: Redirect Host(New addr: 192.168.88.43)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 25d8   0 0000  3f  01 32a9 192.168.88.111  10.16.0.1

64 bytes from 10.16.0.1: icmp_seq=4 ttl=64 time=3.594 ms
x92 bytes from 192.168.88.1: Redirect Host(New addr: 192.168.88.43)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 4016   0 0000  3f  01 186b 192.168.88.111  10.16.0.1

64 bytes from 10.16.0.1: icmp_seq=5 ttl=64 time=2.974 ms

64 bytes from 10.16.0.1: icmp_seq=6 ttl=64 time=4.397 ms
^C
--- 10.16.0.1 ping statistics ---
7 packets transmitted, 7 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 2.557/4.125/9.144/2.119 ms

# --

# 192.168.88.1 is my default gateway
❯ traceroute 10.16.0.1
traceroute to 10.16.0.1 (10.16.0.1), 64 hops max, 52 byte packets
 1  192.168.88.1 (192.168.88.1)  3.669 ms  2.713 ms  2.552 ms
 2  10.16.0.1 (10.16.0.1)  3.303 ms  3.292 ms  3.145 ms

检查 nginx 信息

❯ kubectl get pods -o wide
NAME                                 READY   STATUS    RESTARTS        AGE   IP              NODE           NOMINATED NODE   READINESS GATES
nginx                                1/1     Running   0               39m   172.16.29.154   k8s-worker-0   <none>           <none>

❯ kubectl get svc
NAME                 TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                               AGE                              42h
kubernetes           ClusterIP      10.96.0.1      <none>        443/TCP                               2d5h
nginx                LoadBalancer   10.97.153.41   10.16.0.1     80:30422/TCP                          40m

❯ kubectl get endpoints
NAME                 ENDPOINTS                                                               AGE                                                    42h
kubernetes           192.168.88.43:6443                                                      2d5h
nginx                172.16.29.154:80                                                        41m

❯ kubectl describe service nginx
Name:                     nginx
Namespace:                default
Labels:                   <none>
Annotations:              metallb.universe.tf/address-pool: public
                          metallb.universe.tf/ip-allocated-from-pool: public
Selector:                 app=nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.97.153.41
IPs:                      10.97.153.41
LoadBalancer Ingress:     10.16.0.1
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  30422/TCP
Endpoints:                172.16.29.154:80
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     31552
Events:
  Type    Reason       Age   From                Message
  ----    ------       ----  ----                -------
  Normal  IPAllocated  41m   metallb-controller  Assigned IP ["10.16.0.1"]

calico 配置

❯ kubectl describe bgppeer
Name:         global-peer
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  projectcalico.org/v3
Kind:         BGPPeer
Metadata:
  Creation Timestamp:  2023-09-22T06:37:35Z
  Resource Version:    430164
  UID:                 62d549ac-b7ed-47ab-bc2c-b94de8a08939
Spec:
  As Number:  65530
  Filters:
    default
  Peer IP:  192.168.88.1
Events:     <none>

❯ kubectl describe bgpconfiguration
Name:         default
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  projectcalico.org/v3
Kind:         BGPConfiguration
Metadata:
  Creation Timestamp:  2023-09-22T06:37:24Z
  Resource Version:    839696
  UID:                 fd6aef3e-0a4c-4ecc-afe4-09395b76d107
Spec:
  As Number:                  65500
  Node To Node Mesh Enabled:  false
  Service Load Balancer I Ps:
    Cidr:  10.16.0.0/28
Events:    <none>

❯ kubectl describe bgpfilter
Name:         default
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  projectcalico.org/v3
Kind:         BGPFilter
Metadata:
  Creation Timestamp:  2023-09-22T06:37:34Z
  Resource Version:    434512
  UID:                 df8b5aeb-e25e-481c-9403-fcc54727eea8
Spec:
  exportV4:
    Action:          Reject
    Cidr:            10.16.0.0/28
    Match Operator:  NotIn
Events:              <none>

❯ kubectl describe installation
Name:         default
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  operator.tigera.io/v1
Kind:         Installation
Metadata:
  Creation Timestamp:  2023-09-21T05:32:53Z
  Finalizers:
    tigera.io/operator-cleanup
  Generation:        3
  Resource Version:  1052996
  UID:               2f7928cd-3815-42be-9483-512f2f39fcf9
Spec:
  Calico Network:
    Bgp:         Enabled
    Host Ports:  Enabled
    Ip Pools:
      Block Size:          26
      Cidr:                172.16.0.0/16
      Disable BGP Export:  false
      Encapsulation:       None
      Nat Outgoing:        Disabled
      Node Selector:       all()
    Linux Dataplane:       Iptables
    Multi Interface Mode:  None
    nodeAddressAutodetectionV4:
      First Found:  true
  Cni:
    Ipam:
      Type:                    Calico
    Type:                      Calico
  Control Plane Replicas:      2
  Flex Volume Path:            /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  Kubelet Volume Plugin Path:  /var/lib/kubelet
  Logging:
    Cni:
      Log File Max Age Days:  30
      Log File Max Count:     10
      Log File Max Size:      100Mi
      Log Severity:           Info
  Node Update Strategy:
    Rolling Update:
      Max Unavailable:  1
    Type:               RollingUpdate
  Non Privileged:       Disabled
  Variant:              Calico
Status:
  Calico Version:  v3.26.1
  Computed:
    Calico Network:
      Bgp:         Enabled
      Host Ports:  Enabled
      Ip Pools:
        Block Size:          26
        Cidr:                172.16.0.0/16
        Disable BGP Export:  false
        Encapsulation:       None
        Nat Outgoing:        Disabled
        Node Selector:       all()
      Linux Dataplane:       Iptables
      Multi Interface Mode:  None
      nodeAddressAutodetectionV4:
        First Found:  true
    Cni:
      Ipam:
        Type:                    Calico
      Type:                      Calico
    Control Plane Replicas:      2
    Flex Volume Path:            /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
    Kubelet Volume Plugin Path:  /var/lib/kubelet
    Logging:
      Cni:
        Log File Max Age Days:  30
        Log File Max Count:     10
        Log File Max Size:      100Mi
        Log Severity:           Info
    Node Update Strategy:
      Rolling Update:
        Max Unavailable:  1
      Type:               RollingUpdate
    Non Privileged:       Disabled
    Variant:              Calico
  Conditions:
    Last Transition Time:  2023-09-23T15:08:30Z
    Message:
    Observed Generation:   3
    Reason:                Unknown
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2023-09-23T15:08:30Z
    Message:
    Observed Generation:   3
    Reason:                Unknown
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-09-23T15:08:30Z
    Message:               DaemonSet "calico-system/calico-node" is not available (awaiting 1 nodes)
    Observed Generation:   3
    Reason:                ResourceNotReady
    Status:                True
    Type:                  Progressing
  Mtu:                     1450
  Variant:                 Calico
Events:                    <none>

calicoctl node status节目

在此处输入图片描述

metallb 配置

❯ kubectl -n metallb-system describe ipaddresspool
Name:         public
Namespace:    metallb-system
Labels:       <none>
Annotations:  <none>
API Version:  metallb.io/v1beta1
Kind:         IPAddressPool
Metadata:
  Creation Timestamp:  2023-09-22T07:57:01Z
  Generation:          3
  Resource Version:    434511
  UID:                 aa2d0e08-4fee-4ce1-a822-9bf4f52a3a3c
Spec:
  Addresses:
    10.16.0.0/28
  Auto Assign:       true
  Avoid Buggy I Ps:  true
Events:              <none>

那么如何调试这个问题,为什么我无法从集群外部访问 Nginx 服务。

更新0

在主节点和工作节点上运行tcpdump -n -i ens18 host 10.16.0.1,然后尝试在集群外的其他机器上运行,ping 10.16.0.1arping -I 10.16.0.1主节点和工作节点上都给出这些

# ping only show on master node
# 192.168.88.111 is the other machine(my laptop)
dropped privs to tcpdump
tcpdump: listening on ens18, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:31:29.842416 IP (tos 0x0, ttl 63, id 33297, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.88.111 > 10.16.0.1: ICMP echo request, id 14677, seq 0, length 64
15:31:29.842465 IP (tos 0x0, ttl 64, id 38235, offset 0, flags [none], proto ICMP (1), length 84)
    10.16.0.1 > 192.168.88.111: ICMP echo reply, id 14677, seq 0, length 64
15:31:30.846173 IP (tos 0x0, ttl 63, id 14503, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.88.111 > 10.16.0.1: ICMP echo request, id 14677, seq 1, length 64
15:31:30.846206 IP (tos 0x0, ttl 64, id 38836, offset 0, flags [none], proto ICMP (1), length 84)
    10.16.0.1 > 192.168.88.111: ICMP echo reply, id 14677, seq 1, length 64
# arping received 0 responses
# arping showing both on master and worker node
15:33:08.269603 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:09.269759 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:10.269768 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:11.269725 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:12.269575 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:13.269695 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:14.269685 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:15.269715 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46
15:33:16.269754 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.16.0.1 (Broadcast) tell 192.168.88.63, length 46

答案1

我建议问这个问题Calico 用户松弛

在您的 BGPConfigurations 中,我看到您已禁用全网格,这是故意的吗?您是否使用了路由反射器? Node To Node Mesh Enabled: false

我将首先检查 Mikrotik 上的 BGP 路由,例如“如果您运行的是旧版本的路由器操作系统,请进行调整” /ip/route/print detail from=[find gateway~"^192.168.88.63[0x00-0 xff]*"]

如果看起来正确,我会检查 LB 是否可以通过 Miki 访问 /tool/fetch http://10.43.1.1

然后检查 Bird 协议状态,尝试从 calico-node pods 执行 birdcl kubectl exec -n calico-system ds/calico-node -c calico-node -- birdcl show protocols

kubectl exec -n calico-system ds/calico-node -c calico-node -- birdcl show route

相关内容