./include/linux/skbuff.h:4470 处的内核错误导致服务器挂起

./include/linux/skbuff.h:4470 处的内核错误导致服务器挂起

服务器的 Linux 版本为 5.5.0-050500-generic,操作系统版本为 Ubuntu 20.04 LTS,我有两个接口连接到 ovs 网桥。按照正常流程,数据包在从外部流量生成器 [标准网卡 2 接口,每个接口位于不同的命名空间] 进行 ping 时,会重定向到网桥上的一个接口到另一个接口。它工作正常。运行 iperf/iperf3 时,其内核崩溃。当时的内核日志如下。

[  589.827773] kernel BUG at ./include/linux/skbuff.h:4470!
[  589.827812] invalid opcode: 0000 [#1] SMP NOPTI
[  589.827818] CPU: 49 PID: 0 Comm: swapper/49 Tainted: G           OE     5.5.0-050500-generic #202001262030
[  589.827820] Hardware name: Dell Inc. PowerEdge R740/0WXD1Y, BIOS 2.6.4 04/09/2020
[  589.827881] Code: 28 89 47 2c e9 66 ff ff ff 48 8d 5f 50 48 89 df e8 ee a3 45 fa 84 c0 0f 84 52 ff ff ff 48 89 df e8 ae f6 45 fa e9 45 ff ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 53 48
[  589.827889] RSP: 0018:ffffb1e0872cc660 EFLAGS: 00010202
[  589.827896] RAX: 0000000000000008 RBX: ffff935334acd300 RCX: 0000000000000001
[  589.827899] RDX: 37815ffd09b20000 RSI: ffff934b67091000 RDI: ffff935334acd300
[  589.827901] RBP: ffffb1e0872cc698 R08: ffff9357175114ac R09: 0000000000000001
[  589.827904] R10: 0000000000000128 R11: 0000000000000178 R12: ffff934b67091000
[  589.827906] R13: ffff934b67094000 R14: ffff93571578c480 R15: 0000000000000001
[  589.827909] FS:  0000000000000000(0000) GS:ffff93571fc00000(0000) knlGS:0000000000000000
[  589.827914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  589.827917] CR2: 00007f3cac01b468 CR3: 000000097160a001 CR4: 00000000007606e0
[  589.827920] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  589.827922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  589.827924] PKRU: 55555554
[  589.827927] Call Trace:
[  589.827930]  <IRQ>
[  589.827942]  dev_hard_start_xmit+0x91/0x1f0
[  589.827953]  ? validate_xmit_skb+0x2f0/0x340
[  589.827965]  sch_direct_xmit+0x113/0x340
[  589.827976]  __dev_queue_xmit+0x57e/0x9d0
[  589.827986]  ? reweight_entity+0x16d/0x1b0
[  589.827995]  dev_queue_xmit+0x10/0x20
[  589.828007]  ovs_vport_send+0xa3/0x140 [openvswitch]
[  589.828014]  do_output+0x59/0x170 [openvswitch]
[  589.828022]  do_execute_actions+0x9ae/0x9d0 [openvswitch]
[  589.828031]  ? timerqueue_add+0x9b/0xb0
[  589.828044]  ? enqueue_hrtimer+0x3d/0x90
[  589.828054]  ? ktime_get+0x3e/0xa0
[  589.828062]  ? __update_load_avg_cfs_rq+0x1eb/0x2c0
[  589.828066]  ? attach_entity_load_avg+0x132/0x1a0
[  589.828071]  ? kmem_cache_alloc_node+0x1b3/0x260
[  589.828079]  ovs_execute_actions+0x48/0x110 [openvswitch]
[  589.828086]  ovs_dp_process_packet+0x99/0x1c0 [openvswitch]
[  589.828101]  ? netdev_create+0x40/0x40 [openvswitch]
[  589.828114]  ? ovs_ct_update_key+0x4d/0x110 [openvswitch]
[  589.828122]  ? netdev_create+0x40/0x40 [openvswitch]
[  589.828130]  ovs_vport_receive+0x77/0xd0 [openvswitch]
[  589.828135]  ? __update_load_avg_cfs_rq+0x1eb/0x2c0
[  589.828139]  ? account_entity_enqueue+0xa7/0xd0
[  589.828149]  ? __enqueue_entity+0x96/0xa0
[  589.828161]  ? enqueue_entity+0x116/0x660
[  589.828170]  ? record_times+0x1b/0x90
[  589.828179]  ? native_smp_send_reschedule+0x2a/0x40
[  589.828190]  netdev_frame_hook+0xca/0x190 [openvswitch]
[  589.828196]  __netif_receive_skb_core+0x2db/0xf70
[  589.828210]  ? get_page_from_freelist+0x1dc/0x390
[  589.828218]  ? tcp4_gro_receive+0x136/0x1a0
[  589.828225]  __netif_receive_skb_list_core+0x126/0x2c0
[  589.828231]  netif_receive_skb_list_internal+0x1d5/0x300
[  589.828237]  gro_normal_list.part.0+0x1e/0x40
[  589.828247]  napi_complete_done+0x91/0x140
[  589.828273]  efx_poll+0x282/0x580 [sfc]
[  589.828280]  net_rx_action+0x147/0x3b0
[  589.828289]  __do_softirq+0xe1/0x2d6
[  589.828297]  irq_exit+0xae/0xb0
[  589.828302]  do_IRQ+0x5a/0xf0
[  589.828306]  common_interrupt+0xf/0xf
[  589.828308]  </IRQ>
[  589.828316] RIP: 0010:cpuidle_enter_state+0xca/0x3e0

答案1

返回标准 Ubuntu 内核(目前为 v5.4):

sudo apt update && sudo apt install linux-generic
sudo apt-get autoremove "linux-image-unsigned-5.5.0-*"

或者,如果您确实需要更高版本,您可以通过安装硬件启用分支来获取相当现代且受支持的(当前为 v5.8)内核:

sudo apt-get install linux-generic-hwe-20.04

导致此问题的内核是可能是规范提供的“主线”构建:一次性二进制文件仅用于帮助您诊断内核问题。不要在生产中运行不受支持的主线版本,并在找出使用它们跟踪的任何错误后立即停止运行它们。

OVS 已被多次损坏并且还会再次损坏,并且您遇到的问题很可能已在所有(发行版或上游)支持的版本中得到修复。

不过,请尝试询问造成此情况的人。

当然,如果服务器的内核被废弃,并且一年内没有受到任何关注,那当然是件很糟糕的事情,但是导致做出这一决定的问题也可能会给业务带来严重影响,而且如果您在内核切换后无法进行测试,而您又重新引入了一个旧的错误,那么后果可能会很糟糕。

相关内容