如何排除由第 3 层 IP 地址触发的 LACP 抖动接口故障。(瞻博网络)

如何排除由第 3 层 IP 地址触发的 LACP 抖动接口故障。(瞻博网络)

我目前正在解决导致整个组织网络中断的网络问题。(我在下面附上了一张简单拓扑图)。我们的网络由路由器/防火墙 [SRX340]、2 个接入交换机 [EX-2300] 和 1 个架顶交换机 (ToR) [EX2300] 组成。

(抱歉,我无法将图片直接发布到帖子中,因为我的声誉不够,但您可以在这里找到图片:https://i.stack.imgur.com/ldX23.png

  • 所有交换机之间的链路都是 VLAN(4、10、20……)的主干道

  • 我们的拓扑在所有设备上启用了 GLOBAL RSTP(快速生成树)。根桥是(ToR 交换机),收敛后,被阻止的端口是ge-0/0/4ge-0/0/5在(路由器/防火墙)上。

问题:

我们一直面临问题ae0接口,连接(路由器/防火墙)和我们的(ToR 交换机)。根据我们的日志,当我们在任何(ToR 交换机)接入端口上添加新的第 3 层 IP 地址设备时,LACP 开始出现抖动,但仅适用于 VLAN 10。

值得注意的是,如果我从(ToR Switch)上拔下导致问题的设备,问题会更快地得到解决,否则,在事件开始几分钟后,问题就会自行解决。

root@TOR-SW01> show log messages
Apr  4 12:24:04  TOR-SW01 mib2d[17124]: SNMP_TRAP_LINK_DOWN: ifIndex 610, ifAdminStatus up(1), ifOperStatus down(2), ifName ae0
Apr  4 12:24:06  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/44: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]:   ifd 705; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0 ETH: ifd (ge-0/0/44) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/45: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:06  TOR-SW01 dc-pfe[16887]:   ifd 706; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0   ifd 705; Ether boolean set error (22)
Apr  4 12:24:06  TOR-SW01 fpc0 ETH: ifd (ge-0/0/45) unknown boolean option 112
Apr  4 12:24:07  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/46: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]: ETH: ifd (ge-0/0/46) unknown boolean option 112
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:07  TOR-SW01 dc-pfe[16887]:   ifd 707; Ether boolean set error (22)
Apr  4 12:24:07  TOR-SW01 fpc0 IFFPC: 'IFD Ether boolean set' (opcode 55) failed
Apr  4 12:24:07  TOR-SW01 lacpd[17161]: LACP_INTF_MUX_STATE_CHANGED: ae0: ge-0/0/47: Lacp state changed from ATTACHED to DETACHED, actor port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|ACT|, partner port state : |-|DEF|-|-|OUT_OF_SYNC|AGG|SHORT|PAS|

如上所述,仅当连接到(ToR 交换机)上的任何接入端口的设备通过 DHCP 或手动获取 VLAN 10 的 IP 地址时,才会复制此问题。由于人们无法访问路由器,导致整个网络中断。对于任何其他 VLAN,不会发生此问题。此外,如果设备连接到任何(接入交换机)端口并获取 VLAN 10 的 IP 地址,则不会触发任何操作,也不会发生任何问题。

我迄今做了什么?

  • 我已经尝试手动移除(拔掉)两个接口ge-0/0/4ge-0/0/5在(路由器/防火墙)上,以尽量减少任何网络循环,但问题仍然存在。

  • 截至撰写本文时,我尝试将所有 Juniper 设备更新至最新版本:(22.4R1)

  • 我还清除了所有(接入交换机)的 MAC 地址表,以及(路由器/防火墙)上的 ARP。

  • 我尝试重新启动我们的(路由器/防火墙)以及所有交换机;但问题仍然存在。

我的问题是:

  • 为什么当第 3 层设备连接到任何(ToR 交换机)接入端口(rstp 边缘)时才能获取 IP 地址,我的 ae0 链路(LACP)会关闭。
  • 我如何判断这是一个第 3 层问题还是第 2 层问题?
  • 为什么该问题仅发生在 ToR 交换机上?
  • 我还能做些什么来帮助解决这个问题?

有用的配置:

路由器/防火墙配置:

root@RT01> show configuration interfaces ae0
aggregated-ether-options {
    lacp {
        active;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members all;
        }
    }
}
root@RT01> show configuration protocols rstp
bridge-priority 16k;
interface ge-0/0/4 {
    mode point-to-point;
    no-root-port;
}
interface ge-0/0/5 {
    mode point-to-point;
}
interface ae0 {
    mode point-to-point;
}

ToR 交换机配置:

root@TOR-SW01> show configuration interfaces ae0
traceoptions {
    flag all;
}
aggregated-ether-options {
    lacp {
        active;
    }
}
unit 0 {
    family ethernet-switching {
        interface-mode trunk;
        vlan {
            members all;
        }
    }
}
root@TOR-SW01> show configuration protocols rstp
bridge-priority 8k;
interface xe-0/1/2 {
    mode point-to-point;
}
interface xe-0/1/3 {
    mode point-to-point;
}
interface ae0 {
    mode point-to-point;
}

有用的日志:

来自(ToR 交换机)的日志 - 正常行为:

root@RT01> show lacp interfaces ae0
Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/0       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/0     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/1       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/1     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/2       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/2     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/3       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/3     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/0                  Current   Fast periodic Collecting distributing
      ge-0/0/1                  Current   Fast periodic Collecting distributing
      ge-0/0/2                  Current   Fast periodic Collecting distributing
      ge-0/0/3                  Current   Fast periodic Collecting distributing

来自(路由器/防火墙)的日志-问题开始时:

root@RT01> show log message
Apr  4 12:24:07  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ge-0/0/5.0
Apr  4 12:24:07  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default generated on port ge-0/0/5.0
Apr  4 12:24:12  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ae0.0
Apr  4 12:24:12  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default generated on port ae0.0
Apr  4 12:24:18  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default received on port ae0.0
Apr  4 12:24:20  RT01 l2cpd[2018]: TOPO_CH: for Instance 0 in  routing-instance default received on port ae0.0
Apr  4 12:24:27  RT01 l2cpd[2018]: ROOT_PORT: for Instance 0 in  routing-instance default Interface ge-0/0/5.0

来自(ToR 交换机)的日志 - 问题开始时:

root@TOR-SW01> show lacp interfaces ae0
Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic Collecting distributing
      ge-0/0/45                 Current   Fast periodic Collecting distributing
      ge-0/0/46                 Current   Fast periodic Collecting distributing
      ge-0/0/47                 Current   Fast periodic Collecting distributing


(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/45      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/46      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
      ge-0/0/47      Actor   Yes    No    No   No  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes   No   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Expired   Fast periodic           Attached
      ge-0/0/45                 Expired   Fast periodic           Attached
      ge-0/0/46                 Expired   Fast periodic           Attached
      ge-0/0/47                 Expired   Fast periodic           Attached

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/44    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/45      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/45    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/46      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/46    Partner    No   Yes    No   No   No   Yes     Fast   Passive
      ge-0/0/47      Actor    No   Yes    No   No   No   Yes     Fast    Active
      ge-0/0/47    Partner    No   Yes    No   No   No   Yes     Fast   Passive
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44               Defaulted   Fast periodic           Detached
      ge-0/0/45               Defaulted   Fast periodic           Detached
      ge-0/0/46               Defaulted   Fast periodic           Detached
      ge-0/0/47               Defaulted   Fast periodic           Detached

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No    No   No   No   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic            Waiting
      ge-0/0/45                 Current   Fast periodic            Waiting
      ge-0/0/46                 Current   Fast periodic            Waiting
      ge-0/0/47                 Current   Fast periodic            Waiting

(...)

Aggregated interface: ae0
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      ge-0/0/44      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/44    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/45    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/46    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47      Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      ge-0/0/47    Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      ge-0/0/44                 Current   Fast periodic Collecting distributing
      ge-0/0/45                 Current   Fast periodic Collecting distributing
      ge-0/0/46                 Current   Fast periodic Collecting distributing
      ge-0/0/47                 Current   Fast periodic Collecting distributing

来自(ToR 交换机)的日志- 生成树:

root@TOR-SW01> show log rstp
Apr  4 12:24:24.173413 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: PORT_DISABLED, State: NOT_EDGE
Apr  4 12:24:24.174609 BDSM: Port ae0.0: Moved to state NOT OPER EDGE
Apr  4 12:24:24.174653 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_RESET, State: ACTIVE
Apr  4 12:24:24.174682 TCSM: Port ae0.0: No Operations to perform
Apr  4 12:24:24.174748 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_DISABLED, State: CURRENT
Apr  4 12:24:24.174783 PISM: Port ae0.0: Moving to state DISABLED
Apr  4 12:24:24.174812 PISM: Port ae0.0: Moved to state DISABLED
Apr  4 12:24:24.186004 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: NOT_DESG_ROOT, State: ACTIVE
Apr  4 12:24:24.186094 TCSM: Port ae0.0: Moved to state LEARNING
Apr  4 12:24:24.186129 TCSM: Port ae0.0 Role is NOT ROOT/DESIGNATED; Changing to INACTIVE state
Apr  4 12:24:24.186158 TCSM: Port ae0.0: Moved to state INACTIVE
Apr  4 12:24:24.192122 TCSM: Learnt Entries on Port ae0.0 have been flushed!
Apr  4 12:24:24.192423 MSG: Management Disabling of Port 3 Success

Apr  4 12:24:29.552531 PISM: Port ae0.0: Port Info State Machine Called with Event: PORT_ENABLED, State: DISABLED
Apr  4 12:24:29.552658 PISM: Port ae0.0: Moved to state AGED
Apr  4 12:24:29.552779 PISM: Port ae0.0: Port Info State Machine Called with Event: UPDATE_INFO, State: AGED
Apr  4 12:24:29.552820 PISM: Port ae0.0: UPDATING port info
Apr  4 12:24:29.552865 PISM: Port ae0.0: Moved to state UPDATE
Apr  4 12:24:29.552968 PISM: Port ae0.0: Moved to state CURRENT
Apr  4 12:24:29.553278 MSG: Management Enabling of Port 3 Success

Apr  4 12:24:32.562537 TMR: Port ae0.0: EDGEDELAYWHILE Timer EXPIRED forInstance: 0
Apr  4 12:24:32.562673 BDSM: Port ae0.0: Bridge Detection State Machine Called with Event: EDGEDELAYWHILE_EXP, State: NOT_EDGE
Apr  4 12:24:32.562711 BDSM: Port ae0.0: Moved to state OPER EDGE
Apr  4 12:24:32.562802 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: LEARN_SET, State: INACTIVE
Apr  4 12:24:32.562844 TCSM: Port ae0.0: Moved to state LEARNING
Apr  4 12:24:32.563595 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: FORWARD, State: LEARNING
Apr  4 12:24:32.563683 TCSM: Port ae0.0: Topo Ch State Machine Called with Event: OPEREDGE_SET, State: LEARNING
Apr  4 12:24:32.563719 TCSM: Port ae0.0: No Operations to perform

答案1

简而言之,答案是 - 特定 JunOS 版本 + EX2300 的组合是问题所在。

我的答案是:

  1. 如果配置了处理客户端设备 IP 流量(dhcp 监听、DAI、IPSG)的安全功能,ex2300 的 CPU 可能会暂时过载。高 CPU 使用率可能会导致任何 OS 进程出现问题,无论是 lacp、ospf 还是 snmp。请记住,CPU 时间在所有系统进程之间共享。Juniper EX2200\2300 系列因其有限的资源(CPU、RAM 和 TCAM 大小)而臭名昭著。我必须说 EX2300 拥有适中的单核 ARM Cortex-A9 CPU。要查看它,请运行 cli 命令show system boot-messages。如果您仔细阅读“路由设备以太网接口用户指南”中的“聚合以太网接口”部分,您将看到:

注意:在 EX2300 和 EX3400 交换机上,必须为 LACP 协议配置周期性 SLOW 计时器,以防止在 CPU 密集型操作事件(例如路由引擎切换、接口抖动以及从数据包转发引擎进行详尽数据收集)期间出现抖动。

  1. 您只能通过实验来判断其是第 2 层问题还是第 3 层问题。尝试从以太网帧中添加\删除第 3 层报头(在连接的设备上禁用 ipv4\ipv6 或学习使用数据包生成器(如 scapy)来生成测试流量),或者您可以尝试在实验室环境中或局域网维护期间停用一些可疑的 Junos 功能,如 dhcp 监听、dai、ipsg。

  2. 它发生在 TOR 交换机上,因为它与其他交换机不同,无论是配置还是 Junos 版本

  3. 首先尝试set interfaces <ae interface> aggregated-ether-options lacp periodic slow。然后您可以尝试使用 18.3R3-S4 或其他软件(相信我,最新的 Junos 版本并不总是最好的)。最后,您可以将聚合以太网接口设为静态(静态 LAG 而不是 lacp)

这是我们的故事...

我们已将 ex2300 交换机从 18.3R3-S4 升级到 21.4R3-S3.4,并且我们的监控系统开始报告 STP 拓扑变化。(顺便说一下,18.3R3-S4 非常稳定)

我们的调查显示,当控制平面(在集成 CPU 上运行的 JunOS)从数据平面(在 Broadcom 交换芯片又称 ASIC 上运行的 pfe 软件)收到错误时,CPU 利用率会出现峰值

dc-pfe[17947]: ETH: ifd (xe-0/1/0) unknown boolean option 112
dc-pfe[17947]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
dc-pfe[17947]:   ifd 697; Ether boolean set error (22)
dc-pfe[17947]: ETH: ifd (xe-0/1/1) unknown boolean option 112
dc-pfe[17947]: IFFPC: 'IFD Ether boolean set' (opcode 55) failed
dc-pfe[17947]:   ifd 698; Ether boolean set error (22)

发生这种情况时,EX2300 无法及时传输 LACP BPDU。某些 JunOS 进程导致 CPU 过载,从而导致“lacpd”和其他进程处于 WAIT 状态,或者 pfe 暂时丢弃数据包。

无论如何,在链路的另一端,这会导致 AE 接口震荡,因为 LACP 必须以固定的时间间隔发送和接收 BPDU。

如果 AE 接口是 RSTP\MSTP 非边缘端口,则会发生 STP 拓扑更改,这需要清除非边缘端口的 MAC 地址。它需要 Junos 在后台更改 pfe 的状态(重新编程 ASIC CAM\TCAM 硬件表、重新学习 mac 地址等)。并且 STP 拓扑更改引起的中断波及整个 LAN...

因此,我们了解到 lacp 超时 + stp 问题与高 CPU 使用率有关,可以通过配置 ex2300 来缓解

 set interfaces <interface> aggregated-ether-options lacp periodic slow 

并且不要忘记在链接两端都这样做... https://supportportal.juniper.net/s/article/EX-How-transmit-rate-LACP-Interval-is-negotiated-between-Actor-and-Partner?language=en_US

相关内容