我有两台机器:edge1(10.22.46.11)和edge2(10.22.46.48),它们都是k8s工作节点。在edge1上,我尝试访问一个服务,该服务的端点位于edge2上。我发送这样的请求:
curl -m 5 host-edge-nginx
curl: (28) Connection timed out after 5001 milliseconds\
bash-5.1# nslookup host-edge-nginx
Server: 169.254.25.10
Address: 169.254.25.10#53
Name: host-edge-nginx.fabedge-e2e-test.svc.cluster.local
Address: 10.233.52.186
服务host-edge-nginx-793的IP是10.233.52.186,该IP分配给edge1上的kube-ipvs0。如您所见,请求超时。 tcpdump 输出:
[root@edge1 ~]# tcpdump -nn -i any port 80 or 30080
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
17:06:49.407148 IP 10.22.46.11.57104 > 10.22.46.48.30080: Flags [S], seq 1003993294, win 43690, options [mss 65495,sackOK,TS val 84142687 ecr 0,nop,wscale 7], length 0
17:06:49.407481 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197964760 ecr 84142687,nop,wscale 7], length 0
17:06:50.407345 IP 10.22.46.11.57104 > 10.22.46.48.30080: Flags [S], seq 1003993294, win 43690, options [mss 65495,sackOK,TS val 84143688 ecr 0,nop,wscale 7], length 0
17:06:50.407688 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197965760 ecr 84142687,nop,wscale 7], length 0
17:06:51.408815 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197966762 ecr 84142687,nop,wscale 7], length 0
17:06:52.411309 IP 10.22.46.11.57104 > 10.22.46.48.30080: Flags [S], seq 1003993294, win 43690, options [mss 65495,sackOK,TS val 84145692 ecr 0,nop,wscale 7], length 0
17:06:52.411652 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197967764 ecr 84142687,nop,wscale 7], length 0
17:06:54.808781 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197970162 ecr 84142687,nop,wscale 7], length 0
17:06:58.808756 IP 10.22.46.48.30080 > 10.22.46.11.57104: Flags [S.], seq 4067044182, ack 1003993295, win 28960, options [mss 1460,sackOK,TS val 197974162 ecr 84142687,nop,wscale 7], length 0
看来连接无法完成握手,客户端一次又一次地重传syn数据包。
我使用了 ss 和 netstat,但没有发现任何结果。然后我使用 conntrack 并发现了这个:
[root@edge1 ~]# conntrack -L | grep 30080
conntrack v1.4.4 (conntrack-tools): 17 flow entries have been shown.
tcp 6 59 SYN_RECV src=10.233.52.186 dst=10.233.52.186 sport=52626 dport=80 src=10.22.46.48 dst=10.22.46.11 sport=30080 dport=1147 mark=0 use=1
如您所见,连接停留在 SYNC_RECV 状态。我猜客户端没有收到ACK数据包,但我也使用了iptables跟踪:
[84771.535104] TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.22.46.11 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=38970 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
[84771.535121] TRACE: mangle:PREROUTING:policy:1 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.22.46.11 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=38970 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
[84771.535148] TRACE: mangle:INPUT:policy:1 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.233.52.186 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=52628 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
[84771.535160] TRACE: filter:INPUT:rule:1 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.233.52.186 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=52628 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
[84771.535175] TRACE: filter:KUBE-NODE-PORT:return:2 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.233.52.186 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=52628 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
[84771.535186] TRACE: filter:INPUT:policy:2 IN=eth0 OUT= MAC=fa:16:3e:39:f6:2d:fa:16:3e:c8:8d:b2:08:00 SRC=10.22.46.48 DST=10.233.52.186 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=30080 DPT=52628 SEQ=3634038526 ACK=1413810229 WINDOW=28960 RES=0x00 ACK SYN URGP=0 OPT (020405B40402080A0BD1B6490508ECD001030307)
如您所见,ACK 数据包已通过过滤器 INPUT 策略,默认情况下为 ACCPT。
[root@edge1 ~]# iptables -t filter -S | grep INPUT
-P INPUT ACCEPT
-A INPUT -m comment --comment "kubernetes health check rules" -j KUBE-NODE-PORT
所以我认为这意味着客户端已经收到了ACK数据包。
我被困在这里,没有更多线索。欢迎任何帮助,并提前致谢
答案1
事实证明,ACK 数据包恰好与 xfrm 策略匹配,并被 xfrm 丢弃。
答案2
如果你使用 externalTrafficPolicy:Local 和 proxy-mode=ipvs 你可能会遇到这个https://github.com/kubernetes/kubernetes/issues/93456#issuecomment-733069629