我在(第二手)裸机服务器上运行 Debian bullseye,偶尔会崩溃(在 8 天的时间里发生了 3 次),我似乎不明白为什么。我也没有找到重现它的方法,因为原因似乎来自系统外部。
在三种情况下,会发生以下情况:
- 系统(实际上)空闲
- 内核日志中的堆栈跟踪有
NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
错误,没有先前的消息(与先前消息的间隔通常是几个小时)。 - 该消息
e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
每 10-20 秒重复一次。 - 此时,网络已关闭,我无法再访问它,因此我多次发出硬件重置以使其恢复并再次运行。
现在,我确实第一次尝试看看是否可以重置网络(通过控制台),(虽然我没有尝试删除/重新插入驱动程序模块,不确定这是否会有帮助),但所有'总之,这似乎并不是很有成效的努力,所以我决定重新启动并希望有最好的结果。
任何人都可以帮助我采用某种方法来调试这种情况,如果它再次出现,也许还有一些关于如何重现问题的指示,以及一种在不重置硬件的情况下使其再次运行的方法?
日志文件
(这只是第一次,所有 3 次日志都是相同的,或者至少非常相似)
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662109] ------------[ cut here ]------------
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662249] NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662401] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662554] Modules linked in: dm_mod xt_nat vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_
ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel evdev kvm irqbypass rapl intel_cstate intel_uncore wdat_wdt intel_pch_thermal
watchdog ee1004 serio_raw ie31200_edac acpi_pad button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multip
ath linear raid1 md_mod crc32_pclmul crc32c_intel ahci xhci_pci ghash_clmulni_intel xhci_hcd libahci nvme e1000e libata aesni_intel usbcore libaes crypto_simd scsi_mod nvme_core ptp psmouse pps_core cryptd glue_helper t10_pi i2c_i801 crc_t10dif
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.663385] crct10dif_generic i2c_smbus crct10dif_pclmul crct10dif_common wmi usb_common video
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664310] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.0-21-amd64 #1 Debian 5.10.162-1
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664461] Hardware name: FUJITSU /D3417-B2, BIOS V5.0.0.12 R1.27.0.SR.1 for D3417-B2x 06/10/2020
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664630] RIP: 0010:dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664747] Code: eb a9 48 8b 1c 24 c6 05 c7 16 0d 01 01 48 89 df e8 b5 73 fa ff 44 89 e9 48 89 de 48 c7 c7 08 b8 b6 91 48 89 c2 e8 da a0 14 00 <0f> 0b eb 86 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664968] RSP: 0018:ffffbb7e40128eb0 EFLAGS: 00010282
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665088] RAX: 0000000000000000 RBX: ffff920c20740000 RCX: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665234] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665381] RBP: ffff920c207403dc R08: 0000000000000000 R09: ffffbb7e40128cd0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665532] R10: ffffbb7e40128cc8 R11: ffffffff920cb6a8 R12: ffff920b4143c080
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665681] R13: 0000000000000000 R14: ffff920c20740480 R15: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665832] FS: 0000000000000000(0000) GS:ffff921a2e440000(0000) knlGS:0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665985] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666101] CR2: 000000c0002f9000 CR3: 0000000c9480a001 CR4: 00000000003726e0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666249] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666394] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666537] Call Trace:
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666646] <IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666754] ? pfifo_fast_enqueue+0x150/0x150
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666868] call_timer_fn+0x27/0x100
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666988] __run_timers.part.0+0x1d9/0x250
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667106] ? ktime_get+0x35/0xa0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667223] ? lapic_next_deadline+0x28/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667340] ? clockevents_program_event+0x8a/0xf0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667462] run_timer_softirq+0x26/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667536] __do_softirq+0xc2/0x279
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667610] asm_call_irq_on_stack+0xf/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667684] </IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667755] do_softirq_own_stack+0x37/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667830] irq_exit_rcu+0x92/0xc0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667904] sysvec_apic_timer_interrupt+0x36/0x80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667980] asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668057] RIP: 0010:cpuidle_enter_state+0xc7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668133] Code: 8b 3d dd 71 f4 6e e8 b8 9a 9f ff 49 89 c5 0f 1f 44 00 00 31 ff e8 29 a6 9f ff 45 84 ff 0f 85 fe 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 0a 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668256] RSP: 0018:ffffbb7e400c3ea8 EFLAGS: 00000246
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668334] RAX: ffff921a2e473c40 RBX: 0000000000000006 RCX: 000000000000001f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668425] RDX: 0000000000000000 RSI: 0000000021c15a3d RDI: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668517] RBP: ffff921a2e47e800 R08: 00007429fb821b6a R09: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668608] R10: 0000000000000000 R11: 0000000000002b55 R12: ffffffff921aea80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668700] R13: 00007429fb821b6a R14: 0000000000000006 R15: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668792] ? cpuidle_enter_state+0xb7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668867] cpuidle_enter+0x29/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668941] do_idle+0x1f3/0x2b0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669015] cpu_startup_entry+0x19/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669089] secondary_startup_64_no_verify+0xb0/0xbb
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669166] ---[ end trace 4e1f5ac6215c3384 ]---
硬件信息
# lspci -vvvv -s 0000:00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
Subsystem: Fujitsu Technology Solutions Ethernet Connection (2) I219-LM
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 126
IOMMU group: 8
Region 0: Memory at ef200000 (32-bit, non-prefetchable) [size=128K]
Capabilities: [c8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee002b8 Data: 0000
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e
Kernel modules: e1000e
uname -a
Linux Debian-1106-bullseye-amd64-base 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
内核包信息
apt show linux-image-5.10.0-21-amd64
Package: linux-image-5.10.0-21-amd64
Version: 5.10.162-1
Built-Using: linux (= 5.10.162-1)
Priority: optional
Section: kernel
Source: linux-signed-amd64 (5.10.162+1)
Maintainer: Debian Kernel Team <[email protected]>
Installed-Size: 318 MB
Depends: kmod, linux-base (>= 4.3~), initramfs-tools (>= 0.120+deb8u2) | linux-initramfs-tool
Recommends: firmware-linux-free, apparmor
Suggests: linux-doc-5.10, debian-kernel-handbook, grub-pc | grub-efi-amd64 | extlinux
Conflicts: linux-image-5.10.0-21-amd64-unsigned
Breaks: fwupdate (<< 12-7), initramfs-tools (<< 0.120+deb8u2), wireless-regdb (<< 2019.06.03-1~), xserver-xorg-input-vmmouse (<< 1:13.0.99)
Replaces: linux-image-5.10.0-21-amd64-unsigned
Homepage: https://www.kernel.org/
Download-Size: 55.5 MB
APT-Manual-Installed: no
APT-Sources: http://security.debian.org/debian-security bullseye-security/main amd64 Packages
Description: Linux 5.10 for 64-bit PCs (signed)
The Linux kernel 5.10 and modules for use on PCs with AMD64, Intel 64 or
VIA Nano processors.
.
The kernel image and modules are signed for use with Secure Boot.
答案1
看到 TX 超时时首先要尝试的事情之一是禁用 TSO。
sudo ethtool -k enp0s31f6 tso off
我也有兴趣知道是否ethtool -S enp0s31f6
显示任何奇怪的计数器,例如任何错误,或者特别是tx_tcp_seg_failed
和tx_tcp_seg_good
。
如果您遇到中断问题(对此我会感到惊讶),那么您可以在使用参数加载驱动程序时始终尝试禁用 MSI 或 MSI-X IntMode=
。看内核文档。
作为参考,这里是我的 I219 运行 e1000e 的输出。如果您下面的任何统计数据非零,而我的统计数据为零,我建议您仔细研究这些统计数据上升的原因。
$ ethtool -S enp0s31f6 | grep tx_
tx_packets: 133102433
tx_bytes: 178802443357
tx_broadcast: 163
tx_multicast: 5121
tx_errors: 0
tx_dropped: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 0
tx_tcp_seg_good: 20245901
tx_tcp_seg_failed: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
tx_smbus: 0
tx_dma_failed: 0
tx_hwtstamp_timeouts: 0
tx_hwtstamp_skipped: 0