我的问题

我的问题

我的问题

AF_NETLINK对内核的查询间歇性地需要很多秒才能得到答复,例如在以下strace跟踪中:

10:42:38.864353 socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3
10:42:38.864377 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
10:42:38.864399 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
10:42:38.864418 setsockopt(3, SOL_NETLINK, NETLINK_EXT_ACK, [1], 4) = 0
10:42:38.864436 bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
10:42:38.864459 getsockname(3, {sa_family=AF_NETLINK, nl_pid=16296, nl_groups=00000000}, [12]) = 0
10:42:38.864491 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581759, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:42:51.894848 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

背景

我注意到,有时软件会在尝试解析 IP 地址时挂起。主要是浏览器,但也有新的ssh或任何其他需要 DNS 的东西。

使用 Wireshark,我能够检查挂起是否发生在 DNS 查询数据包发送到名称服务器之前,因此它本身并不是一个滞后的名称服务器。

跟踪一些相关进程表明,该进程时不时地会/etc/resolv.conf首先读取具有 IPV6 地址的数据:

# Generated by NetworkManager
search example.de otherexample.de
nameserver 192.168.178.1
nameserver 2a02:8070:c19e:b400:xxxx:xxxx:xxxx:xxxx
nameserver fd00::9a9b:cbff:xxxx:xxxx

然后读取/etc/gai.conf其中除了注释之外不包含任何内容,然后,显然,使用 AF_NETLINK 套接字获取接口列表。

大多数时候,sendto和 相应的recvmsg时间间隔只有几毫秒,但在某些情况下,这感觉就像永远挂起。

这让我意识到问题甚至不在于 DNS,事实上,ip a循环运行有时也会挂起几秒钟。所以我这样做的同时将每个 ip a and logging the output and thestrace` 跟踪到两个不同的文件。这表明问题大约每分钟发生一次,持续约 12-13 秒:

10:41:58.561713 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581719, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:41:58.561943 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

10:42:38.864491 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581759, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:42:51.894848 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

10:43:42.269435 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581823, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:43:54.894689 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

10:44:45.276410 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581886, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:44:57.894722 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

10:45:48.273509 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1588581949, pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, {{nla_len=8, nla_type=IFLA_EXT_MASK}, 1}}, 40, 0, NULL, 0) = 40
10:46:00.894574 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2608

第一对是正常情况下发生的示例;其他对显示问题如何每分钟发生一次并持续约 12 秒。

在此期间,网络没有发生重大变化;这是ip a第一次暂停之前和之后的输出示例:

Mon May  4 10:42:38 CEST 2020
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether a8:5e:45:60:e4:be brd ff:ff:ff:ff:ff:ff
    inet 192.168.178.131/24 brd 192.168.178.255 scope global dynamic noprefixroute enp3s0
       valid_lft 83515sec preferred_lft 83515sec
    inet6 2a02:8070:c19e:b400:bec7:94b4:34f1:86b4/64 scope global dynamic noprefixroute 
       valid_lft 7078sec preferred_lft 3478sec
    inet6 fe80::d27:8efd:f696:3c47/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: wlp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d0:ab:d5:0e:02:09 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.10/24 brd 192.168.10.255 scope global dynamic noprefixroute wlp7s0
       valid_lft 602858sec preferred_lft 602858sec
    inet6 fe80::c694:6683:6353:e9c9/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: wlxf4f26d08d54e: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether f4:f2:6d:08:d5:4e brd ff:ff:ff:ff:ff:ff
Mon May  4 10:42:52 CEST 2020
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether a8:5e:45:60:e4:be brd ff:ff:ff:ff:ff:ff
    inet 192.168.178.131/24 brd 192.168.178.255 scope global dynamic noprefixroute enp3s0
       valid_lft 83514sec preferred_lft 83514sec
    inet6 2a02:8070:c19e:b400:bec7:94b4:34f1:86b4/64 scope global dynamic noprefixroute 
       valid_lft 7077sec preferred_lft 3477sec
    inet6 fe80::d27:8efd:f696:3c47/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: wlp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d0:ab:d5:0e:02:09 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.10/24 brd 192.168.10.255 scope global dynamic noprefixroute wlp7s0
       valid_lft 602857sec preferred_lft 602857sec
    inet6 fe80::c694:6683:6353:e9c9/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
4: wlxf4f26d08d54e: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether f4:f2:6d:08:d5:4e brd ff:ff:ff:ff:ff:ff

问题

什么可能导致内核延迟对AF_NETLINK/ RTM_GETLINKsocket 调用的回复几秒钟,每分钟一次?

据我所知,这些调用是由内核直接处理的,而不是由其他进程处理(我可以strace超时处理)。它是否正确?

如果是这样,什么会使内核一次又一次地阻止这些请求?怎么调试呢?

答案1

感谢 Eduardo Trapani 的评论,我能够解决这个问题。

当我移除提供wlxf4f26d08d54e上述ip a输出中的接口的 USB WIFI 适配器后,问题就消失了。

根据lsusb,该设备的确切规格是

Bus 001 Device 010: ID 0bda:8179 Realtek Semiconductor Corp. RTL8188EUS 802.11n Wireless Network Adapter

有趣的是,在 中根本没有关于该设备的条目/var/log/syslog,除了它在启动时被识别并且我拔掉了它;所以我不认为它有错误的连接或类似的情况。

根据https://linux-hardware.org/index.php?id=usb:0bda-8179,这个驱动程序从 3.12 版本开始就已经存在于内核中,并且几乎在任何地方都可以工作,所以我真的不知道我的具体问题是什么。

不过很高兴现在已经解决了。

相关内容