我一直在调查 centos7 集群上棘手的路由问题的根本原因......
行为:
- 来自 Docker 容器的 TCP 数据包到达集群外部的目标,但响应数据包无法到达等待该应答的容器
- 现在使用 iptables 日志强烈表明“路由决策”(用 iptables 的话来说)导致了此问题。更准确地说:响应数据包在阶段“mangle PREROUTING”仍然存在,但在阶段“mangle FORWARD/INPUT”缺失
- 使用“ip route get”的结果如下:
## Check route from container to service host outside of cluster
ip route get to 172.17.27.1 from 10.233.70.32 iif cni0
## Works just fine as metioned. Result:
# 172.17.27.1 from 10.233.70.32 dev ens192
# cache iif cni0
## Check route from service host outside of cluster back to container
ip route get to 10.233.70.32 from 172.17.27.1 iif ens192
## Does not work. Error Msg:
# RTNETLINK answers: No route to host
- 然后我非常确定路由表中的某处一定有错误的路由配置。命令“ip route list”给出:
default via 172.17.0.2 dev ens192 proto static
10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink
10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink
10.233.66.0/24 via 10.233.66.0 dev flannel.1 onlink
10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink
10.233.68.0/24 via 10.233.68.0 dev flannel.1 onlink
10.233.69.0/24 via 10.233.69.0 dev flannel.1 onlink
10.233.70.0/24 dev cni0 proto kernel scope link src 10.233.70.1 # this is the local container network
10.233.71.0/24 via 10.233.71.0 dev flannel.1 onlink
172.17.0.0/18 dev ens192 proto kernel scope link src 172.17.31.118
192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown
虽然我在上面这个规则中找不到任何错误,但与使用相同 ansible 脚本配置的第二个集群相比,它变得更加令人困惑。健康集群的输出:
- “IP 路由获取”:
## Check route from container to service host outside of cluster ip route get to 172.17.27.1 from 10.233.66.2 iif cni0 ## Works: # 172.17.27.1 from 10.233.66.2 dev eth0 # cache iif cni0 ## Check route from service host outside of cluster back to container ip route get to 10.233.66.2 from 172.17.27.1 iif eth0 ## Worked! But why when using same rules as unhealthy cluster above? - please see below: # 10.233.66.2 from 172.17.27.1 dev cni0 # cache iif eth0
- “IP 路由列表”:
default via 172.17.0.2 dev eth0 proto dhcp metric 100 10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink 10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink 10.233.66.0/24 dev cni0 proto kernel scope link src 10.233.66.1 # this is the local container network 10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink 172.17.0.0/18 dev eth0 proto kernel scope link src 172.17.43.231 metric 100 192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown
有什么想法或提示吗?
太感谢了!
答案1
最后我们终于弄清楚了导致这种奇怪行为的原因。原来,不健康的集群上除了 NetworkManager 之外还安装了“systemd-networkd”。
在这种情况下“systemd-networkd”仅在启动期间短暂处于活动状态。显然,这种行为导致网络堆栈处于轻微损坏的状态。
禁用“systemd-networkd”并在这些机器上重新推出 kubernetes 解决了这个问题。