Linux Centos 7.7 上“ip route get” 显示“无到主机的路由”

Linux Centos 7.7 上“ip route get” 显示“无到主机的路由”

我一直在调查 centos7 集群上棘手的路由问题的根本原因......

行为:

  • 来自 Docker 容器的 TCP 数据包到达集群外部的目标,但响应数据包无法到达等待该应答的容器
  • 现在使用 iptables 日志强烈表明“路由决策”(用 iptables 的话来说)导致了此问题。更准确地说:响应数据包在阶段“mangle PREROUTING”仍然存在,但在阶段“mangle FORWARD/INPUT”缺失
  • 使用“ip route get”的结果如下:
## Check route from container to service host outside of cluster
ip route get to 172.17.27.1 from 10.233.70.32 iif cni0
## Works just fine as metioned. Result:
# 172.17.27.1 from 10.233.70.32 dev ens192 
# cache iif cni0 

## Check route from service host outside of cluster back to container
ip route get to 10.233.70.32 from 172.17.27.1 iif ens192
## Does not work. Error Msg:
# RTNETLINK answers: No route to host
  • 然后我非常确定路由表中的某处一定有错误的路由配置。命令“ip route list”给出:
default via 172.17.0.2 dev ens192 proto static 
10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink 
10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink 
10.233.66.0/24 via 10.233.66.0 dev flannel.1 onlink 
10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink 
10.233.68.0/24 via 10.233.68.0 dev flannel.1 onlink 
10.233.69.0/24 via 10.233.69.0 dev flannel.1 onlink 
10.233.70.0/24 dev cni0 proto kernel scope link src 10.233.70.1 # this is the local container network  
10.233.71.0/24 via 10.233.71.0 dev flannel.1 onlink 
172.17.0.0/18 dev ens192 proto kernel scope link src 172.17.31.118 
192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown 
  • 虽然我在上面这个规则中找不到任何错误,但与使用相同 ansible 脚本配置的第二个集群相比,它变得更加令人困惑。健康集群的输出:

    • “IP 路由获取”:

    ## Check route from container to service host outside of cluster
    ip route get to 172.17.27.1 from 10.233.66.2 iif cni0
    ## Works:
    # 172.17.27.1 from 10.233.66.2 dev eth0 
    # cache iif cni0 
    
    ## Check route from service host outside of cluster back to container
    ip route get to 10.233.66.2 from 172.17.27.1 iif eth0
    ## Worked! But why when using same rules as unhealthy cluster above? - please see below:
    # 10.233.66.2 from 172.17.27.1 dev cni0 
    # cache iif eth0 
    
    
    • “IP 路由列表”:

    default via 172.17.0.2 dev eth0 proto dhcp metric 100 
    10.233.64.0/24 via 10.233.64.0 dev flannel.1 onlink 
    10.233.65.0/24 via 10.233.65.0 dev flannel.1 onlink 
    10.233.66.0/24 dev cni0 proto kernel scope link src 10.233.66.1 # this is the local container network
    10.233.67.0/24 via 10.233.67.0 dev flannel.1 onlink 
    172.17.0.0/18 dev eth0 proto kernel scope link src 172.17.43.231 metric 100 
    192.168.1.0/24 dev docker0 proto kernel scope link src 192.168.1.5 linkdown
    

有什么想法或提示吗?

太感谢了!

答案1

最后我们终于弄清楚了导致这种奇怪行为的原因。原来,不健康的集群上除了 NetworkManager 之外还安装了“systemd-networkd”。

在这种情况下“systemd-networkd”仅在启动期间短暂处于活动状态。显然,这种行为导致网络堆栈处于轻微损坏的状态。

禁用“systemd-networkd”并在这些机器上重新推出 kubernetes 解决了这个问题。

相关内容