调试不可靠的 IPv6 连接

调试不可靠的 IPv6 连接

在我们的 VPS 上,我们面临 IPv6 的连接问题,希望有人可以帮助调试该问题。

Ping 一开始会失败,但稍后会成功:

2020-06-01 23:20:55 <user>@<host>:~# ping -6 google.com
PING google.com(ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e)) 56 data bytes
From <host>.com (<ip>) icmp_seq=1 Destination unreachable: Address unreachable
...
From <host>.com (<ip>) icmp_seq=6 Destination unreachable: Address unreachable
64 bytes from ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e): icmp_seq=7 ttl=54 time=14.0 ms
...
64 bytes from ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e): icmp_seq=13 ttl=54 time=12.1 ms
--- google.com ping statistics ---
13 packets transmitted, 7 received, +6 errors, 46% packet loss, time 12174ms
rtt min/avg/max/mdev = 12.151/12.683/14.069/0.767 ms

可以看出 DNS 解析立即成功,这不是问题。第一次发出 ping 会抛出错误消息,从第 7 次开始成功。第一次 ping 成功所需的时间各不相同。

curl立即切换到 IPv4:

2020-06-01 23:21:16 <user>@<host>:~# curl -vIL google.com
* Rebuilt URL to: google.com/
*   Trying 2a00:1450:400e:807::200e...
* TCP_NODELAY set
*   Trying 172.217.17.142...
* TCP_NODELAY set
* Connected to google.com (172.217.17.142) port 80 (#0)
...

wget尝试更长时间进行连接,有时会成功,有时会失败,并且也会切换到 IPv4:

2020-06-02 00:49:11 <user>@<host>:~# wget --spider google.com
Spider mode enabled. Check if remote file exists.
--2020-06-02 00:51:01--  http://google.com/
Resolving google.com (google.com)... 2a00:1450:400e:807::200e, 172.217.17.142
Connecting to google.com (google.com)|2a00:1450:400e:807::200e|:80... failed: No route to host.
Connecting to google.com (google.com)|172.217.17.142|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/ [following]
Spider mode enabled. Check if remote file exists.
--2020-06-02 00:51:20--  http://www.google.com/
Resolving www.google.com (www.google.com)... 2a00:1450:400e:804::2004, 172.217.17.36
Connecting to www.google.com (www.google.com)|2a00:1450:400e:804::2004|:80... failed: No route to host.
Connecting to www.google.com (www.google.com)|172.217.17.36|:80... connected.
HTTP request sent, awaiting response... 200 OK

顺便说一下,无论主机/IP 是什么,这种情况都会发生。有默认路由,接口有一个链路本地地址和一个全局 IPv6 地址,通过 DHCPv6 分配:

2020-06-02 00:58:25 <user>@<host>:~# ip -6 r
::1 dev lo proto kernel metric 256 pref medium
::/64 dev eth0 proto kernel metric 256 expires 2590394sec pref medium
<ipv6> dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
default via <gateway> dev eth0 proto ra metric 1024 expires 194sec pref medium

2020-06-02 00:58:56 <user>@<host>:~# ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 <ipv6>/128 scope global
       valid_lft forever preferred_lft forever
    inet6 <LLA>/64 scope link
       valid_lft forever preferred_lft forever

IPv4 连接总是立即成功。

rdisc6输出:

2020-06-02 13:10:36 <user>@<host>:~# rdisc6 eth0
Soliciting ff02::2 (ff02::2) on eth0...

Hop limit                 :    undefined (      0x00)
Stateful address conf.    :          Yes
Stateful other conf.      :           No
Mobile home agent         :           No
Router preference         :       medium
Neighbor discovery proxy  :           No
Router lifetime           :         1800 (0x00000708) seconds
Reachable time            :  unspecified (0x00000000)
Retransmit time           :  unspecified (0x00000000)
 Source link-layer address: <MAC>
 Prefix                   : ::/64
  On-link                 :          Yes
  Autonomous address conf.:           No
  Valid time              :      2592000 (0x00278d00) seconds
  Pref. time              :       604800 (0x00093a80) seconds
 from fe80::<ipv6>

traceroute6(有时会因 30 个空行而失败):

2020-06-02 13:14:18 <user>@<host>:~# traceroute6 google.com
traceroute to google.com (2a00:1450:400e:807::200e) from <ipv6>::142, port 33434, from port 54573, 30 hops max, 60 bytes packets
 1  * * <ipv6>::1 (<ipv6>::1)  2055.792 ms
 2  * 2a06:7f80::1 (2a06:7f80::1)  2055.700 ms  1.262 ms
 3  ipv6.decix-dusseldorf.core1.dus1.he.net (2001:7f8:9e::1b1b:0:1)  2058.316 ms  2.655 ms  2.810 ms
 4  100ge5-2.core1.ams1.he.net (2001:470:0:371::1)  4.658 ms  3.804 ms  3.865 ms
 5  de-cix.fra.google.com (2001:7f8::3b41:0:1)  4.731 ms  12.465 ms  9.900 ms
 6  2001:4860:0:11e1::e (2001:4860:0:11e1::e)  14.691 ms  10.691 ms  10.654 ms
 7  2001:4860:0:1::1c7f (2001:4860:0:1::1c7f)  12.320 ms  11.433 ms  11.476 ms
 8  2001:4860::c:4000:d9a9 (2001:4860::c:4000:d9a9)  15.681 ms  16.138 ms  14.906 ms
 9  ams15s30-in-x0e.1e100.net (2a00:1450:400e:807::200e)  15.327 ms  12.979 ms  12.162 ms

ip monitor/ip mon route表明默认路由似乎无法可靠地到达,并且在过期后会定期被删除,并且并不总是会在不久后重新创建。以下是几个小时的输出:

fe80::<ipv6_1> dev eth0 lladdr <mac_1> PROBE
fe80::<ipv6_1> dev eth0 lladdr <mac_1> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_1> dev eth0 lladdr <mac_1> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_4> dev eth0 lladdr <mac_4> PROBE
fe80::<ipv6_4> dev eth0 lladdr <mac_4> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_4> dev eth0 lladdr <mac_4> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -4sec pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -11sec pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
prefix ::/64dev eth0 onlink valid 2592000 preferred 604800
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router REACHABLE
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router STALE
Deleted default via fe80::<ipv6_2> dev eth0 proto ra metric 1024 expires -3sec pref medium
fe80::<ipv6_2> dev eth0 lladdr <mac_2> router PROBE
fe80::<ipv6_2> dev eth0  router FAILED
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE
<ipv4_1> dev eth0 lladdr <mac_1> PROBE
<ipv4_1> dev eth0 lladdr <mac_1> REACHABLE
<ipv4_1> dev eth0 lladdr <mac_1> STALE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> PROBE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> REACHABLE
fe80::<ipv6_3> dev eth0 lladdr <mac_3> STALE

缩小问题范围

下面显示路由器并不总是足够定期地发送路由器通告,因此默认网关条目在 1800 秒后过期,请注意中断 tcpdump 时最后一个 PS1 提示的时间戳:

2020-06-03 12:26:31 <user>@<host>:/var/log# tcpdump -n -i eth0 icmp6 and ip6[40] == 134
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:45:41.290680 IP6 fe80::XXX > ff02::1: ICMP6, router advertisement, length 56
14:11:10.133781 IP6 fe80::XXX > ff02::1: ICMP6, router advertisement, length 56
^C
2 packets captured
5 packets received by filter
0 packets dropped by kernel
2020-06-03 14:58:07 <user>@<host>:/var/log#

虽然前两个 RA 足够接近以保持默认路由(尽管距离到期还有 4 分钟),但第三个 RA 缺失时间太长,因此默认路由丢失,因此不再可能建立 IPv6 连接。

同时,我可以看到来自路由器的大量邻居请求,因此它的 ICMPv6 请求确实到达。

2020-06-03 14:56:03 <user>@<host>:/var/log# tcpdump icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:03:07.750318 IP6 fe80::XXX > ff02::YYY: ICMP6, neighbor solicitation, who has 2a06:ZZZ, length 32
15:03:08.356100 IP6 fe80::XXX > ff02::YYY: ICMP6, neighbor solicitation, who has 2a06:ZZZ, length 32

但是没有 RA 到达,即使尝试强迫他们到达,目前:

2020-06-03 15:03:21 <user>@<host>:/var/log# rdisc6 eth0
Soliciting ff02::2 (ff02::2) on eth0...
Timed out.
Timed out.
Timed out.
No response.

这符合上面的 IP 监视器输出,其中探测路由器经常失败。但是,由于我看到了来自路由器的 ND,我猜它可以回答我,但由于某种原因,它不会分别忽略我的 ND?

我可以通过以下方式手动永久恢复默认路由:

ip -6 r add default dev eth0 via fe80::<ipv6>

虽然 IPv6 连接再次成为可能,但通常仍会出现较长的延迟或完全超时。

答案1

注意 1:您仅使用 DHCPv6 获取地址 - 它是不是用于默认路由。这仍然通过 SLAAC(即 ICMPv6“路由器通告”数据包)完成。

注 2:ip monitor显示几种不同类型的事件混合在一起:地址、路由和邻居缓存条目。您可以运行ip mon routeip mon neigh以分别查看它们。

我会猜测您的 VPS 和最近的网关之间存在问题,因为:

  1. 您的默认网关的邻居条目(ARP 缓存条目的 IPv6 等效项)未成功进入 REACHABLE 状态 - 它一直进入 FAILED 状态,这意味着您的主机发送了多个 ND 请求(相当于 ARP 查询)来更新缓存条目,但未收到任何响应。

    邻居发现,就像 IPv4 的 ARP 一样,是 IPv6 网络正常运行的最低要求。

  2. 每次收到 SLAAC 通告时,默认路由的::/0到期时间都会根据“路由器生存期”进行重置。在您的例子中,通告的生存期为 1800 秒,因此路由器应该重复通告至少每 900 秒一次,因此默认路由永远不会低于其生命周期的一半。

    但从输出中可以看出ip -6 route,您的 ::/0 路由距离到期只有 194 秒。这意味着路由器的计时器配置错误,或者其广播 RA 由于某种原因无法到达您 - 因此,您不断丢失默认路由。

上述两个问题有一个共同点:ND 和 SLAAC 都使用 ICMPv6 多播,因此非常仔细检查您的防火墙是否没有对传入的路由器广告或邻居广告或一般多播数据包施加严格的速率限制。

(您可以使用 tcpdump 检查是否接收数据包;例如,如果 RA 出现在 tcpdump 中但无法更新默认路由,则可能是防火墙的问题。)

相关内容