升级至 16.04 版后，tg3 以太网适配器出现问题

2024-6-9 • tag-icon

在我的 HP ProLiant MicroServer Gen8 服务器上全新安装 16.04 Server 后，我遇到了网络随机断开连接的问题。

服务器将在几个小时到一个多星期内正常运行。但是，在某个时候，它会断开与网络的连接。当这种情况发生时，Syslog 只会显示以下消息。

Jul 12 22:46:11 gil kernel: [210256.898076] tg3 0000:03:00.0 eno1: Link is down

拔掉并重新插入网线没有用。我也尝试了交换机上的另一个端口。

该服务器之前在 14.04 版本下运行稳定，因此我怀疑这是 4.4 内核的 tg3 驱动程序中的一个错误。

ethtool：

ole@gil:~$ sudo ethtool eno1
Settings for eno1:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Half 1000baseT/Full
    Supported pause frame use: No
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Half 1000baseT/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                         100baseT/Half 100baseT/Full
                                         1000baseT/Half 1000baseT/Full
    Link partner advertised pause frame use: Symmetric Receive-only
    Link partner advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: off
    Supports Wake-on: g
    Wake-on: g
    Current message level: 0x000000ff (255)
                           drv probe link timer ifdown ifup rx_err tx_err
    Link detected: yes

ip 链接显示

ole@gil:~$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b0:5a:da:87:43:80 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b0:5a:da:87:43:81 brd ff:ff:ff:ff:ff:ff

消息

ole@gil:~$ dmesg | grep tg3    
[    5.341202] tg3.c:v3.137 (May 11, 2014)
[    5.441154] tg3 0000:03:00.0 eth0: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address b0:5a:da:87:43:80
[    5.483079] tg3 0000:03:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    5.591514] tg3 0000:03:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    5.634705] tg3 0000:03:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[    5.685464] tg3 0000:03:00.1 eth1: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address b0:5a:da:87:43:81
[    5.769032] tg3 0000:03:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    5.809242] tg3 0000:03:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    5.851124] tg3 0000:03:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[    5.873733] tg3 0000:03:00.0 eno1: renamed from eth0
[    6.577027] tg3 0000:03:00.1 eno2: renamed from eth1
[   18.700979] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
[   18.700982] tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
[   18.700983] tg3 0000:03:00.0 eno1: EEE is disabled

有什么提示可以解决此问题吗？我不希望降级到 14.04。

更新：在最近的故障发生后，注意到 kern.log 中出现了以下新条目：

Jul 28 01:46:23 gil kernel: [709412.700133] NMI: PCI system error (SERR) for reason b1 on CPU 0.
Jul 28 01:46:23 gil kernel: [709412.700998] Dazed and confused, but trying to continue
Jul 28 01:46:35 gil kernel: [709424.063839] tg3 0000:03:00.0 eno1: Link is down

知道是什么原因造成的吗？在 14.04 中从未见过这样的情况。

相关内容