在我的 HP ProLiant MicroServer Gen8 服务器上全新安装 16.04 Server 后,我遇到了网络随机断开连接的问题。
服务器将在几个小时到一个多星期内正常运行。但是,在某个时候,它会断开与网络的连接。当这种情况发生时,Syslog 只会显示以下消息。
Jul 12 22:46:11 gil kernel: [210256.898076] tg3 0000:03:00.0 eno1: Link is down
拔掉并重新插入网线没有用。我也尝试了交换机上的另一个端口。
该服务器之前在 14.04 版本下运行稳定,因此我怀疑这是 4.4 内核的 tg3 驱动程序中的一个错误。
ethtool:
ole@gil:~$ sudo ethtool eno1
Settings for eno1:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Link partner advertised pause frame use: Symmetric Receive-only
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
drv probe link timer ifdown ifup rx_err tx_err
Link detected: yes
ip 链接显示
ole@gil:~$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b0:5a:da:87:43:80 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b0:5a:da:87:43:81 brd ff:ff:ff:ff:ff:ff
消息
ole@gil:~$ dmesg | grep tg3
[ 5.341202] tg3.c:v3.137 (May 11, 2014)
[ 5.441154] tg3 0000:03:00.0 eth0: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address b0:5a:da:87:43:80
[ 5.483079] tg3 0000:03:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 5.591514] tg3 0000:03:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 5.634705] tg3 0000:03:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[ 5.685464] tg3 0000:03:00.1 eth1: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address b0:5a:da:87:43:81
[ 5.769032] tg3 0000:03:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 5.809242] tg3 0000:03:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 5.851124] tg3 0000:03:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[ 5.873733] tg3 0000:03:00.0 eno1: renamed from eth0
[ 6.577027] tg3 0000:03:00.1 eno2: renamed from eth1
[ 18.700979] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
[ 18.700982] tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
[ 18.700983] tg3 0000:03:00.0 eno1: EEE is disabled
有什么提示可以解决此问题吗?我不希望降级到 14.04。
更新:在最近的故障发生后,注意到 kern.log 中出现了以下新条目:
Jul 28 01:46:23 gil kernel: [709412.700133] NMI: PCI system error (SERR) for reason b1 on CPU 0.
Jul 28 01:46:23 gil kernel: [709412.700998] Dazed and confused, but trying to continue
Jul 28 01:46:35 gil kernel: [709424.063839] tg3 0000:03:00.0 eno1: Link is down
知道是什么原因造成的吗?在 14.04 中从未见过这样的情况。