诊断 PCI 直通中间歇性故障的 M2 PCIe I225v NIC

诊断 PCI 直通中间歇性故障的 M2 PCIe I225v NIC

我使用的是 M2 PCIe I225v NIC(http://www.iocrest.com/index.php?id=2316) 作为我的 Proxmox 主机上的 PCI 直通 - 传递给 pfSense VM(运行 freeBSD-14,但在 freeBSD-13 上也发生同样的情况)。一段时间以来,它一直存在问题,并且接口不断启动和关闭,尤其是在负载下。dmesg 输出给出:

igc0: link state changed to DOWN
igc0: link state changed to UP
igc0: link state changed to DOWN
igc0: link state changed to UP
igc0: link state changed to DOWN
igc0: link state changed to UP
...

它用作我的 WAN 接口,当然,发生这种情况时网络会断开。我仔细检查了接线和连接,一切看起来都很好。

PROXMOX 主机上的 lspci -s 03:00.0 -vv 的输出

03:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
        Subsystem: Intel Corporation Ethernet Controller I225-V
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        IOMMU group: 10
        Region 0: Memory at c1100000 (32-bit, non-prefetchable) [size=1M]
        Region 3: Memory at c1200000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at c1000000 [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 88-c9-b3-ff-ff-bf-77-ab
        Capabilities: [1c0 v1] Latency Tolerance Reporting
                Max snoop latency: 3145728ns
                Max no snoop latency: 3145728ns
        Capabilities: [1f0 v1] Precision Time Measurement
                PTMCap: Requester:+ Responder:- Root:-
                PTMClockGranularity: 4ns
                PTMControl: Enabled:+ RootSelected:-
                PTMEffectiveGranularity: 4ns
        Capabilities: [1e0 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Kernel driver in use: vfio-pci
        Kernel modules: igc

对我来说这看起来还不错。

以下是我尝试的方法:

  • 禁用/启用卸载:禁用硬件 checksun 卸载会使情况变得更糟,启用 TSO 或 LRO 也会使情况变得更糟,即使驱动程序 (igc) 支持它们。通过 pfSense 界面完成。
  • 禁用 EEE,使用 /boot/loader.conf 中的以下条目:
hw.igc.eee_enable=0
  • 在 pfSense GUI 中手动将速度/双工设置为 2500Base-T,而不是自动选择
  • 观察 PROXMOX 主机上的 lspci 输出以检查 PM 变化。即使主机崩溃,PMv3 状态仍保持在 D0。
  • 检查了 pfSense GUI 中的输入/输出错误 - 没有发现任何错误。中断看起来也正常(大约 100-200/秒)。

所以我没有主意了。我该如何进一步诊断这个问题,还有其他什么原因?谢谢!

相关内容