我正在 Docker 容器中试验 netfilter。我有三个容器,一个是“路由器”,两个是“端点”。它们各自通过以下方式连接pipework
,因此每个端点<->路由器连接都存在一个外部(主机)桥。像这样的东西:
containerA (eth1) -- hostbridgeA -- (eth1) containerR
containerB (eth1) -- hostbridgeB -- (eth2) containerR
然后在“路由器”容器中,我有一个像这样配置的containerR
桥:br0
bridge name bridge id STP enabled interfaces
br0 8000.3a047f7a7006 no eth1
eth2
我已经net.bridge.bridge-nf-call-iptables=0
在主机上进行了此操作,因为这会干扰我的其他一些测试。
containerA
有IP192.168.10.1/24
并且containerB
有192.168.10.2/24
.
然后我有一个非常简单的规则集来跟踪转发的数据包:
flush ruleset
table bridge filter {
chain forward {
type filter hook forward priority 0; policy accept;
meta nftrace set 1
}
}
这样,我发现只跟踪 ARP 数据包,而不跟踪 ICMP 数据包。换句话说,如果我运行nft monitor
while containerA
is pinging containerB
,我可以看到跟踪的 ARP 数据包,但看不到 ICMP 数据包。这让我感到惊讶,因为根据我的理解nftables 的桥接过滤器链类型,数据包不会通过该forward
阶段的唯一时间是如果它通过input
主机发送(在本例中containerR
)。根据 Linux 数据包流程图:
我仍然希望 ICMP 数据包采用转发路径,就像 ARP 一样。我做如果我跟踪路由前和路由后,请查看数据包。所以我的问题是,这里发生了什么?是否存在我不知道的 Flowtable 或其他短路?它是特定于容器网络和/或 Docker 的吗?我可以检查虚拟机而不是容器,但我很感兴趣其他人是否意识到或遇到过这种情况。
编辑:此后,我在 VirtualBox 中使用一组 Alpine 虚拟机创建了类似的设置。 ICMP 数据包做到达forward
链,所以主机或 Docker 中的某些东西似乎干扰了我的期望。在我或其他人能够确定原因之前,我不会回答这个问题,以防其他人知道它有用。
谢谢!
最小可重复示例
为此,我在虚拟机中使用 Alpine Linux 3.19.1,并community
在以下位置启用存储库/etc/apk/respositories
:
# Prerequisites of host
apk add bridge bridge-utils iproute2 docker openrc
service docker start
# When using linux bridges instead of openvswitch, disable iptables on bridges
sysctl net.bridge.bridge-nf-call-iptables=0
# Pipework to let me avoid docker's IPAM
git clone https://github.com/jpetazzo/pipework.git
cp pipework/pipework /usr/local/bin/
# Create two containers each on their own network (bridge)
pipework brA $(docker create -itd --name hostA alpine:3.19) 192.168.10.1/24
pipework brB $(docker create -itd --name hostB alpine:3.19) 192.168.10.2/24
# Create bridge-filtering container then connect it to both of the other networks
R=$(docker create --cap-add NET_ADMIN -itd --name hostR alpine:3.19)
pipework brA -i eth1 $R 0/0
pipework brB -i eth2 $R 0/0
# Note: `hostR` doesn't have/need an IP address on the bridge for this example
# Add bridge tools and netfilter to the bridging container
docker exec hostR apk add bridge bridge-utils nftables
docker exec hostR brctl addbr br
docker exec hostR brctl addif br eth1 eth2
docker exec hostR ip link set dev br up
# hostA should be able to ping hostB
docker exec hostA ping -c 1 192.168.10.2
# 64 bytes from 192.168.10.2...
# Set nftables rules
docker exec hostR nft add table bridge filter
docker exec hostR nft add chain bridge filter forward '{type filter hook forward priority 0;}'
docker exec hostR nft add rule bridge filter forward meta nftrace set 1
# Now ping hostB from hostA while nft monitor is running...
docker exec hostA ping -c 4 192.168.10.2 & docker exec hostR nft monitor
# Ping will succeed, nft monitor will not show any echo-request/-response packets traced, only arps
# Example:
trace id abc bridge filter forward packet: iif "eth2" oif "eth1" ether saddr ... daddr ... arp operation request
trace id abc bridge filter forward rule meta nfrtrace set 1 (verdict continue)
trace id abc bridge filter forward verdict continue
trace id abc bridge filter forward policy accept
...
trace id def bridge filter forward packet: iif "eth1" oif "eth2" ether saddr ... daddr ... arp operation reply
trace id def bridge filter forward rule meta nfrtrace set 1 (verdict continue)
trace id def bridge filter forward verdict continue
trace id def bridge filter forward policy accept
# Add tracing in prerouting and the icmp packets are visible:
docker exec hostR nft add chain bridge filter prerouting '{type filter hook prerouting priority 0;}'
docker exec hostR nft add rule bridge filter prerouting meta nftrace set 1
# Run again
docker exec hostA ping -c 4 192.168.10.2 & docker exec hostR nft monitor
# Ping still works (obviously), but we can see its packets in prerouting, which then disappear from the forward chain, but ARP shows up in both.
# Example:
trace id abc bridge filter prerouting packet: iif "eth1" ether saddr ... daddr ... ... icmp type echo-request ...
trace id abc bridge filter prerouting rule meta nfrtrace set 1 (verdict continue)
trace id abc bridge filter prerouting verdict continue
trace id abc bridge filter prerouting policy accept
...
trace id def bridge filter prerouting packet: iif "eth2" ether saddr ... daddr ... ... icmp type echo-reply ...
trace id def bridge filter prerouting rule meta nfrtrace set 1 (verdict continue)
trace id def bridge filter prerouting verdict continue
trace id def bridge filter prerouting policy accept
...
trace id 123 bridge filter prerouting packet: iif "eth1" ether saddr ... daddr ... ... arp operation request
trace id 123 bridge filter prerouting rule meta nfrtrace set 1 (verdict continue)
trace id 123 bridge filter prerouting verdict continue
trace id 123 bridge filter prerouting policy accept
trace id 123 bridge filter forward packet: iif "eth1" oif "eth2" ether saddr ... daddr ... arp operation request
trace id 123 bridge filter forward rule meta nfrtrace set 1 (verdict continue)
trace id 123 bridge filter forward verdict continue
trace id 123 bridge filter forward policy accept
...
trace id 456 bridge filter prerouting packet: iif "eth2" ether saddr ... daddr ... ... arp operation reply
trace id 456 bridge filter prerouting rule meta nfrtrace set 1 (verdict continue)
trace id 456 bridge filter prerouting verdict continue
trace id 456 bridge filter prerouting policy accept
trace id 456 bridge filter forward packet: iif "eth2" oif "eth1" ether saddr ... daddr ... arp operation reply
trace id 456 bridge filter forward rule meta nfrtrace set 1 (verdict continue)
trace id 456 bridge filter forward verdict continue
trace id 456 bridge filter forward policy accept
# Note the trace id matching across prerouting and forward chains
我也使用 openvswitch 进行了尝试,但为了简单起见,我使用了 Linux 桥接示例,无论如何它都会产生相同的结果。与 openvswitch 的唯一真正区别是net.bridge.bridge-nf-call-iptables=0
不需要,IIRC。
答案1
简介和简化的再现器设置
Docker 加载br_netfilter
模块。一旦加载,它就会影响所有存在的以及未来网络命名空间。这是出于历史和兼容性原因,如中所述我对此问题的回答。
因此,当在主机上完成此操作时:
service docker start # When using linux bridges instead of openvswitch, disable iptables on bridges sysctl net.bridge.bridge-nf-call-iptables=0
这仅影响主机网络命名空间。未来创建的网络命名空间hostR
仍然会得到:
# docker exec hostR sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1
下面是一个比 OP 简单得多的错误重现器。它根本不需要 Docker,也不需要虚拟机:它可以在当前的 Linux 主机上运行,只需要包iproute2
并创建一个桥:在受影响的hostR
命名网络命名空间内:
#!/bin/sh
modprobe br_netfilter # as would have done Docker
sysctl net.bridge.bridge-nf-call-iptables=0 # actually it won't matter: netns hostR will still get 1 when created
ip netns add hostA
ip netns add hostB
ip netns add hostR
ip -n hostR link add name br address 02:00:00:00:01:00 up type bridge
ip -n hostR link add name eth1 up master br type veth peer netns hostA name eth1
ip -n hostR link add name eth2 up master br type veth peer netns hostB name eth1
ip -n hostA addr add dev eth1 192.168.10.1/24
ip -n hostA link set eth1 up
ip -n hostB addr add dev eth1 192.168.10.2/24
ip -n hostB link set eth1 up
ip netns exec hostR nft -f - <<'EOF'
table bridge filter # for idempotence
delete table bridge filter # for idempotence
table bridge filter {
chain forward {
type filter hook forward priority 0;
meta nftrace set 1
}
}
EOF
请注意,网络命名空间br_netfilter
中仍然有其默认设置:hostR
# ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1
一侧运行:
ip netns exec hostR nft monitor trace
以及其他地方:
ip netns exec hostA ping -c 4 192.168.10.2
将触发问题:看不到 IPv4,只有 ARP(在典型的惰性 ARP 更新中,通常会延迟几秒钟)。这对于 6.6.x 或更低版本的内核始终会触发,对于 6.7.x 或更高版本的内核可以触发或不触发(见下文)。
的影响br_netfilter
该模块在桥接路径和 IPv4 的 Netfilter 挂钩之间创建交互,通常用于路由路径,但现在也用于桥接路径。这里IPv4 的挂钩都是iptables和nftables在ip
家庭中(对于 ARP 和 IPv6 也会发生这种情况。IPv6 不使用,我们不再讨论它)。
这意味着现在帧到达 Netfilter 挂钩,如中所述基于 Linux 的桥上的 ebtables/iptables 交互:5. 桥接 IP 数据包的链遍历:
他们应该先到达bridge filter forward
(蓝色),然后到达ip filter forward
(绿色)......
...但当原始钩子优先级更改并依次更改上面框的顺序时则不会。桥系列的原始钩子优先级描述于nft(8)
:
表 7. 桥系列的标准优先级名称和挂钩兼容性
姓名 价值 挂钩 目标地址 -300 预路由 筛选 -200 全部 出去 100 输出 srcnat 300 后路由
因此,上面的原理图预计过滤器转发会在优先级 -200 而不是 0 处挂接。如果使用 0,则所有赌注都会被取消。
事实上,当运行的内核是用选项编译时CONFIG_NETFILTER_NETLINK_HOOK
,nft list hooks
可用于查询当前命名空间中使用的所有钩子,包括br_netfilter
's。对于内核 6.6.x 或之前版本:
# ip netns exec hostR nft list hooks
family ip {
hook prerouting {
-2147483648 ip_sabotage_in [br_netfilter]
}
hook postrouting {
-0000000225 apparmor_ip_postroute
}
}
family ip6 {
hook prerouting {
-2147483648 ip_sabotage_in [br_netfilter]
}
hook postrouting {
-0000000225 apparmor_ip_postroute
}
}
family bridge {
hook prerouting {
0000000000 br_nf_pre_routing [br_netfilter]
}
hook input {
+2147483647 br_nf_local_in [br_netfilter]
}
hook forward {
-0000000001 br_nf_forward_ip [br_netfilter]
0000000000 chain bridge filter forward [nf_tables]
0000000000 br_nf_forward_arp [br_netfilter]
}
hook postrouting {
+2147483647 br_nf_post_routing [br_netfilter]
}
}
我们可以看到内核模块br_netfilter
(在此网络命名空间中未停用)在 IPv4 处挂接在 -1 处,在 ARP 处再次挂接在 0 处:未满足预期的挂接顺序,并且在bridge filter forward
OP 优先级 0 处发生中断。
在内核 6.7.x 及更高版本上,自此犯罪,再现器运行后的默认顺序发生变化:
# ip netns exec hostR nft list hooks
[...]
family bridge {
hook prerouting {
0000000000 br_nf_pre_routing [br_netfilter]
}
hook input {
+2147483647 br_nf_local_in [br_netfilter]
}
hook forward {
0000000000 chain bridge filter forward [nf_tables]
0000000000 br_nf_forward [br_netfilter]
}
hook postrouting {
+2147483647 br_nf_post_routing [br_netfilter]
}
}
经过简化,br_netfilter
仅在优先级 0 处挂钩来处理转发,但重要的是现在后 bridge filter forward
:预期的顺序,不会导致OP的问题。
由于具有相同优先级的两个钩子被认为是未定义的行为,因此这是一种脆弱的设置:只需运行以下命令仍然可以从这里触发问题(至少在内核 6.7.x 上):
rmmod br_netfilter
modprobe br_netfilter
现在改变了顺序:
[...]
hook forward {
0000000000 br_nf_forward [br_netfilter]
0000000000 chain bridge filter forward [nf_tables]
}
[...]
再次触发问题,因为现在br_netfilter
又是以前bridge filter forward
。
如何避免这种情况
要在网络命名空间(或容器)中解决此问题,请选择以下选项之一:
br_netfilter
根本没有加载在主机上:
rmmod br_netfilter
br_netfilter
或禁用附加网络命名空间中的效果正如所解释的,每个新的网络命名空间都会得到再次该功能在创建时启用。必须在重要的地方禁用它:在
hostR
网络命名空间中:ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables=0
一旦完成,所有的
br_netfilter
钩子都会消失,hostR
当意外的订单发生时不会再造成任何干扰。有一个警告。仅使用 Docker 时这不起作用:
# docker exec hostR sysctl net.bridge.bridge-nf-call-iptables=0 sysctl: error setting key 'net.bridge.bridge-nf-call-iptables': Read-only file system # docker exec --privileged hostR sysctl net.bridge.bridge-nf-call-iptables=0 sysctl: error setting key 'net.bridge.bridge-nf-call-iptables': Read-only file system
因为Docker保护了一些设置,防止它们被容器篡改。
相反,必须绑定挂载(使用
ip netns attach ...
)容器的网络命名空间,这样就可以使用它而ip netns exec ...
无需以这种方式获取其挂载命名空间:ip netns attach hostR $(docker inspect --format '{{.State.Pid}}' hostR)
现在允许运行前面的命令并影响容器:
ip netns exec hostR sysctl net.bridge.bridge-nf-call-iptables=0
bridge filter forward
或使用保证首先发生的优先级如上表所示,
priority forward
网桥系列中的默认优先级 ( ) 为 -200。因此使用 -200,或者最多使用值 -2 始终发生在br_netfilter
任何内核版本之前:ip netns exec hostR nft delete chain bridge filter forward ip netns exec hostR nft add chain bridge filter forward '{ type filter hook forward priority -200; }' ip netns exec hostR nft add rule bridge filter forward meta nftrace set 1
或者类似地,如果使用 Docker:
docker exec hostR nft delete chain bridge filter forward docker exec hostR nft add chain bridge filter forward '{ type filter hook forward priority -200; }' docker exec hostR nft add rule bridge filter forward meta nftrace set 1
测试于:
- (OP的)高山3.19.1
- Debian 12.5 与
- 库存 Debian 内核 6.1.x
- 6.6.x 与
CONFIG_NETFILTER_NETLINK_HOOK
- 6.7.11 与
CONFIG_NETFILTER_NETLINK_HOOK
未使用 openvswitch 桥进行测试。
br_netfilter
最后注意:在执行时尽可能避免使用 Docker 或内核模块网络实验。正如我的重现器所示,ip netns
当仅涉及网络时,单独使用实验非常容易(如果实验中需要守护进程(例如 OpenVPN),这可能会变得更加困难)。