我遇到了 kubernetes 和 coreDNS 的问题,无法解决。
系统信息
- 1 个主控 + 2 个工人
- Red Hat Enterprise Linux 版本 9.3(Plow)
- K8S:v1.28.7
- 法兰绒:v0.24.2
- coreDNS:v1.10.1
- kubeadm 用于创建集群
- 没有防火墙
- 防火墙和 SELinux 已禁用
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default dnsutils 1/1 Running 78 (51m ago) 3d6h 10.244.1.9 worker2
kube-flannel kube-flannel-ds-6qkt4 1/1 Running 0 172m 10.229.144.10 master
kube-flannel kube-flannel-ds-jqhql 1/1 Running 0 172m 10.229.144.12 worker2
kube-flannel kube-flannel-ds-qzw5c 1/1 Running 0 172m 10.229.144.11 worker1
kube-system coredns-565799df8b-7nncc 1/1 Running 1 (3h7m ago) 3d7h 10.244.1.8 worker2
kube-system coredns-565799df8b-mcwff 1/1 Running 1 (3h8m ago) 3d7h 10.244.2.8 worker1
kube-system etcd-master 1/1 Running 5 (3h8m ago) 3h13m 10.229.144.10 master
kube-system kube-apiserver-master 1/1 Running 1 (3h8m ago) 4d3h 10.229.144.10 master
kube-system kube-controller-manager-master 1/1 Running 1 (3h8m ago) 4d3h 10.229.144.10 master
kube-system kube-proxy-4lbsp 1/1 Running 1 (3h7m ago) 3h12m 10.229.144.12 worker2
kube-system kube-proxy-6m4l4 1/1 Running 1 (3h8m ago) 3h12m 10.229.144.10 master
kube-system kube-proxy-r9zgx 1/1 Running 1 (3h8m ago) 3h12m 10.229.144.11 worker1
kube-system kube-scheduler-master 1/1 Running 1 (3h8m ago) 4d3h 10.229.144.10 master
问题是,如果将 dns 查询从一个 worker 发送到另一个 worker,则会失败。为了演示这一点,我在 worker2 上运行了 dnsutils pod。如果我从同一 worker 上的 coredns 查询名称:
[jenkins@nordlabkubem ~]$ kubectl exec -it dnsutils -- nslookup kubernetes
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
一切正常。但如果我将查询指向 worker1 上的 coredns:
[jenkins@nordlabkubem ~]$ kubectl exec -it dnsutils -- nslookup kubernetes 10.244.2.8
;; connection timed out; no servers could be reached
command terminated with exit code 1
顺序/etc/resolv.conf
如下:
[user@master ~]$ kubectl exec -it dnsutils -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local mylab.local
nameserver 10.96.0.10
options ndots:5
我已经跟踪这个问题一段时间了。如果我tcpdump
在两个 worker 上进行接口,同时nslookup
在 worker2 上对 coredns 执行 dnsutils 操作,则在 worker1 上看不到传入流量:
[user@worker2 ~]$ sudo tcpdump -nn udp port 8472
13:58:13.330215 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)
13:58:18.330300 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)
13:58:23.330377 IP <worker2>.54673 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9.56384 > 10.244.2.8.53: 50759+ A? kubernetes.default.svc.cluster.local. (54)
[user@worker1 ~]$ sudo tcpdump -nn udp port 8472
# nothing here
如果我 ping 一下,就会看到传入的流量:
[user@master ~]$ kubectl exec -it dnsutils -- ping 10.244.2.8
PING 10.244.2.8 (10.244.2.8) 56(84) bytes of data.
64 bytes from 10.244.2.8: icmp_seq=1 ttl=62 time=0.956 ms
#worker2
14:17:25.308459 IP <worker2>.47186 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9 > 10.244.2.8: ICMP echo request, id 98, seq 1, length 64
14:17:25.309282 IP <worker1>.48946 > <worker2>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.8 > 10.244.1.9: ICMP echo reply, id 98, seq 1, length 64
#worker1
14:17:25.308827 IP <worker2>.47186 > <worker1>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.1.9 > 10.244.2.8: ICMP echo request, id 98, seq 1, length 64
14:17:25.308927 IP <worker1>.48946 > <worker2>.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.244.2.8 > 10.244.1.9: ICMP echo reply, id 98, seq 1, length 64
kube-proxy 日志记录增加,但没有显示任何错误。flannel 日志中没有错误。coredns 日志中没有错误,但我不知道如何增加那里的日志记录。
核心数 cm:
$ kubectl get cm -n kube-system coredns -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
log
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
kube-flannel-cfg:
$ kubectl get cm -n kube-flannel kube-flannel-cfg -o yaml
apiVersion: v1
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"cni-conf.json":"{\n \"name\": \"cbr0\",\n \"cniVersion\": \"0.3.1\",\n \"plugins\": [\n {\n \"type\": \"flannel\",\n \"delegate\": {\n \"hairpinMode\": true,\n \"isDefaultGateway\": true\n }\n },\n {\n \"type\": \"portmap\",\n \"capabilities\": {\n \"portMappings\": true\n }\n }\n ]\n}\n","net-conf.json":"{\n \"Network\": \"10.244.0.0/16\",\n \"Backend\": {\n \"Type\": \"vxlan\"\n }\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app":"flannel","k8s-app":"flannel","tier":"node"},"name":"kube-flannel-cfg","namespace":"kube-flannel"}}
creationTimestamp: "2024-03-11T11:47:42Z"
labels:
app: flannel
k8s-app: flannel
tier: node
name: kube-flannel-cfg
namespace: kube-flannel
resourceVersion: "524202"
uid: 036330ff-62aa-4bf8-8066-f9e5d7314869
我认为网络没有问题,但是 coredns 存在一些奇怪的问题?
如果任何一个 pod 在 master 上运行,也会出现这个问题。
我已经重启了 Pod、服务器,重新安装了 Flannel 等等,但就是无法让它工作。除此之外,集群似乎工作正常 - Pod 正在运行,日志中没有错误。
答案1
问题已通过以下列出的步骤解决 https://github.com/kubernetes/kubernetes/issues/72370#issuecomment-1647206933:
$ sudo ethtool -K ens192 tx-checksum-ip-generic off
$ sudo nmcli con modify ens192 ethtool.feature-tx-checksum-ip-generic off
我没有提到服务器是在 VMWare 上运行的。