Windows 10 上的意外 ARP 探测和 ARP 公告

2024-6-1 • tag-icon

在我们的系统中，有三台主机都连接到同一个以太网交换机，如下所示：

A (192.168.0.21, WIN10_1809) <-> Switch <-> B (192.168.0.100, Debian Linux 9)
                                  ^
                                  |
                       C (192.168.0.201, WIN10_1809)

这些主机之间，任意两台都会周期性地存在网络通信，既有下层的ping操作，也有上层的业务报文（基于TCP或者UDP）。

偶尔（比如一天一次或两天一次）主机B和主机C会发现ping主机A无法通（大概需要7秒左右），而主机A可以ping通主机B和主机C，同时与主机A相关的上层TCP或UDP通信也会失败，而主机B和主机C之间的通信则完全正常。

该问题发生在我们公司的多个系统上，看起来网络硬件（已更换交换机和连接电缆）和网络流量（即使系统处于空闲状态且带宽使用率低于 1％时问题仍然存在）对问题没有造成重大影响。

然后，通过使用 Wireshark 检查系统中的网络流量（通过以太网交换机捕获，下载），我们发现ping请求已经发出，但是没有收到任何回应：

No.     Time        Source          Destination     Protocol Length Info
1455    1.509228    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x6812, seq=1/256, ttl=64 (no response found!)
1848    2.250592    192.168.0.201   192.168.0.21    ICMP    66  Echo (ping) request  id=0x30f0, seq=30977/377, ttl=128 (no response found!)
2413    3.512684    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x6818, seq=1/256, ttl=64 (no response found!)
3269    5.516020    192.168.0.100   192.168.0.21    ICMP    98  Echo (ping) request  id=0x681c, seq=1/256, ttl=64 (no response found!)

同时，主机A的ping请求也得到了如下回复：

1130    1.130713    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2313/2313, ttl=255 (reply in 1133)
1131    1.130713    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2312/2057, ttl=255 (reply in 1132)
1795    2.131109    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2314/2569, ttl=255 (reply in 1798)
1796    2.131110    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2315/2825, ttl=255 (reply in 1797)
2249    3.131295    192.168.0.21    192.168.0.100   ICMP    60  Echo (ping) request  id=0x0008, seq=2316/3081, ttl=255 (reply in 2252)
2250    3.131296    192.168.0.21    192.168.0.201   ICMP    60  Echo (ping) request  id=0x0008, seq=2317/3337, ttl=255 (reply in 2251)

另外我们发现当错误发生时，主机A会启动ARP探测和ARP公告过程。

2838    1.501535    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.100? Tell 192.168.0.21
2841    1.501831    JUMPINDU_64:8b:23   SuperMic_78:e0:f1   ARP 60  192.168.0.100 is at 00:e0:4b:64:8b:23
2876    1.516569    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.201? Tell 192.168.0.21
2879    1.516654    SuperMic_8d:2f:67   SuperMic_78:e0:f1   ARP 60  192.168.0.201 is at ac:1f:6b:8d:2f:67
3234    1.817465    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
4179    2.817637    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
5043    3.817780    SuperMic_78:e0:f1   Broadcast   ARP 60  Who has 192.168.0.21? (ARP Probe)
5897    4.817833    SuperMic_78:e0:f1   Broadcast   ARP 60  ARP Announcement for 192.168.0.21

In which, SuperMic_78:e0:f1 is host A, JUMPINDU_64:8b:23 is host B and SuperMic_8d:2f:67 is host C.

根据RFC 5227：

在开始使用 IPv4 地址（无论是通过手动配置、DHCP 还是其他方式获得）之前，实施此规范的主机必须通过广播 ARP 探测数据包来测试该地址是否已被使用。当网络接口从非活动状态转换为活动状态时、当计算机从睡眠状态唤醒时、当链路状态更改表示以太网电缆已连接时、当 802.11 无线接口与新基站关联时，或者当发生任何其他连接变化（主机主动连接到逻辑链路时）时，这也适用。

但是从主机 A 上的 Windows 事件日志来看，没有上述任何事件的证据，只有下面列出的三个事件日志——不确定这些是问题的原因还是结果：

ID   Source                   Description
7040 Service Control Manager  The start type of the windows modules installer service was changed from auto start to demand start
16   Kernel-General           The access history in hive \??\C:\ProgramData\Microsoft\Provisioning\Microsoft-Desktop-Provisioning-Sequence.dat was cleared updating 0 keys and creating 0 modified pages
7040 Service Control Manager  The start type of the windows modules installer service was changed from demand start to auto start

我们还检查了现场的日志文件，没有发现任何出现问题的证据——现场使用的是 WIN7 和旧版本的 SW，而家里使用的是 WIN10 和新 SW。

调查了近两个月，但仍然没有找到根本原因。任何建议或意见都将不胜感激。另外，如果有其他地方更适合解决此类问题，请告诉我。

答案1

事实证明，该问题是由 Windows 10 本身提供的计划任务引起的，该任务位于 Microsoft/Windows/Management/Provisioning/Logon 下。它在操作系统启动后首次执行时会引发网络堆栈重启（自 1803 或 1809 版本起）：

\windows\system32\provtool.exe /turn 5 /source LogonIdleTask

当我们在操作系统启动后手动运行该任务时，问题可以重现。然后，在禁用该任务后，问题不再发生在五个系统上，我们已观察了近一周。

另外，我们能到达这里主要是因为这篇关于OSR的文章。虽然不知道该任务实际上做什么以及为什么需要重新启动网络堆栈。

附言：如果有人遇到同样的问题，就留下这个，希望它能有所帮助。

答案1

相关内容