在 kubernetes 中，节点之间的 DNS 查询失败

2024-6-2 • tag-icon

我遇到了 kubernetes 和 coreDNS 的问题，无法解决。

系统信息

1 个主控 + 2 个工人
Red Hat Enterprise Linux 版本 9.3（Plow）
K8S：v1.28.7
法兰绒：v0.24.2
coreDNS：v1.10.1
kubeadm 用于创建集群
没有防火墙
防火墙和 SELinux 已禁用

NAMESPACE      NAME                                               READY   STATUS    RESTARTS       AGE     IP              NODE
default        dnsutils                                           1/1     Running   78 (51m ago)   3d6h    10.244.1.9      worker2
kube-flannel   kube-flannel-ds-6qkt4                              1/1     Running   0              172m    10.229.144.10   master
kube-flannel   kube-flannel-ds-jqhql                              1/1     Running   0              172m    10.229.144.12   worker2
kube-flannel   kube-flannel-ds-qzw5c                              1/1     Running   0              172m    10.229.144.11   worker1
kube-system    coredns-565799df8b-7nncc                           1/1     Running   1 (3h7m ago)   3d7h    10.244.1.8      worker2
kube-system    coredns-565799df8b-mcwff                           1/1     Running   1 (3h8m ago)   3d7h    10.244.2.8      worker1
kube-system    etcd-master                                        1/1     Running   5 (3h8m ago)   3h13m   10.229.144.10   master
kube-system    kube-apiserver-master                              1/1     Running   1 (3h8m ago)   4d3h    10.229.144.10   master
kube-system    kube-controller-manager-master                     1/1     Running   1 (3h8m ago)   4d3h    10.229.144.10   master
kube-system    kube-proxy-4lbsp                                   1/1     Running   1 (3h7m ago)   3h12m   10.229.144.12   worker2
kube-system    kube-proxy-6m4l4                                   1/1     Running   1 (3h8m ago)   3h12m   10.229.144.10   master
kube-system    kube-proxy-r9zgx                                   1/1     Running   1 (3h8m ago)   3h12m   10.229.144.11   worker1
kube-system    kube-scheduler-master                              1/1     Running   1 (3h8m ago)   4d3h    10.229.144.10   master

问题是，如果将 dns 查询从一个 worker 发送到另一个 worker，则会失败。为了演示这一点，我在 worker2 上运行了 dnsutils pod。如果我从同一 worker 上的 coredns 查询名称：

[jenkins@nordlabkubem ~]$ kubectl exec -it dnsutils -- nslookup kubernetes
Server:     10.96.0.10
Address:    10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

一切正常。但如果我将查询指向 worker1 上的 coredns：

[jenkins@nordlabkubem ~]$ kubectl exec -it dnsutils -- nslookup kubernetes 10.244.2.8
;; connection timed out; no servers could be reached

command terminated with exit code 1

顺序/etc/resolv.conf如下：

[user@master ~]$ kubectl exec -it dnsutils -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local mylab.local
nameserver 10.96.0.10
options ndots:5

我已经跟踪这个问题一段时间了。如果我tcpdump在两个 worker 上进行接口，同时nslookup在 worker2 上对 coredns 执行 dnsutils 操作，则在 worker1 上看不到传入流量：

[user@worker2 ~]$ sudo tcpdump -nn udp port 8472
13:58:13.330215 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)
13:58:18.330300 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)
13:58:23.330377 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)

[user@worker1 ~]$ sudo tcpdump -nn udp port 8472
# nothing here

如果我 ping 一下，就会看到传入的流量：

[user@master ~]$ kubectl exec -it dnsutils -- ping 10.244.2.8
PING 10.244.2.8 (10.244.2.8) 56(84) bytes of data.
64 bytes from 10.244.2.8: icmp_seq=1 ttl=62 time=0.956 ms


#worker2
14:17:25.308459 IP <worker2>.47186 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9 > 10.244.2.8: ICMP echo request, id 98, seq 1, length 64
14:17:25.309282 IP <worker1>.48946 > <worker2>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.8 > 10.244.1.9: ICMP echo reply, id 98, seq 1, length 64

#worker1
14:17:25.308827 IP <worker2>.47186 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9 > 10.244.2.8: ICMP echo request, id 98, seq 1, length 64
14:17:25.308927 IP <worker1>.48946 > <worker2>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.8 > 10.244.1.9: ICMP echo reply, id 98, seq 1, length 64

kube-proxy 日志记录增加，但没有显示任何错误。flannel 日志中没有错误。coredns 日志中没有错误，但我不知道如何增加那里的日志记录。

核心数 cm:

$ kubectl get cm -n kube-system coredns -o yaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        log
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap

kube-flannel-cfg：

$ kubectl get cm -n kube-flannel kube-flannel-cfg -o yaml
apiVersion: v1
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"cni-conf.json":"{\n  \"name\": \"cbr0\",\n  \"cniVersion\": \"0.3.1\",\n  \"plugins\": [\n    {\n      \"type\": \"flannel\",\n      \"delegate\": {\n        \"hairpinMode\": true,\n        \"isDefaultGateway\": true\n      }\n    },\n    {\n      \"type\": \"portmap\",\n      \"capabilities\": {\n        \"portMappings\": true\n      }\n    }\n  ]\n}\n","net-conf.json":"{\n  \"Network\": \"10.244.0.0/16\",\n  \"Backend\": {\n    \"Type\": \"vxlan\"\n  }\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app":"flannel","k8s-app":"flannel","tier":"node"},"name":"kube-flannel-cfg","namespace":"kube-flannel"}}
  creationTimestamp: "2024-03-11T11:47:42Z"
  labels:
    app: flannel
    k8s-app: flannel
    tier: node
  name: kube-flannel-cfg
  namespace: kube-flannel
  resourceVersion: "524202"
  uid: 036330ff-62aa-4bf8-8066-f9e5d7314869

我认为网络没有问题，但是 coredns 存在一些奇怪的问题？

如果任何一个 pod 在 master 上运行，也会出现这个问题。

我已经重启了 Pod、服务器，重新安装了 Flannel 等等，但就是无法让它工作。除此之外，集群似乎工作正常 - Pod 正在运行，日志中没有错误。

答案1

问题已通过以下列出的步骤解决 https://github.com/kubernetes/kubernetes/issues/72370#issuecomment-1647206933：

$ sudo ethtool -K ens192 tx-checksum-ip-generic off
$ sudo nmcli con modify ens192 ethtool.feature-tx-checksum-ip-generic off

我没有提到服务器是在 VMWare 上运行的。

答案1

相关内容