服务器的 Linux 版本为 5.5.0-050500-generic,操作系统版本为 Ubuntu 20.04 LTS,我有两个接口连接到 ovs 网桥。按照正常流程,数据包在从外部流量生成器 [标准网卡 2 接口,每个接口位于不同的命名空间] 进行 ping 时,会重定向到网桥上的一个接口到另一个接口。它工作正常。运行 iperf/iperf3 时,其内核崩溃。当时的内核日志如下。
[ 589.827773] kernel BUG at ./include/linux/skbuff.h:4470!
[ 589.827812] invalid opcode: 0000 [#1] SMP NOPTI
[ 589.827818] CPU: 49 PID: 0 Comm: swapper/49 Tainted: G OE 5.5.0-050500-generic #202001262030
[ 589.827820] Hardware name: Dell Inc. PowerEdge R740/0WXD1Y, BIOS 2.6.4 04/09/2020
[ 589.827881] Code: 28 89 47 2c e9 66 ff ff ff 48 8d 5f 50 48 89 df e8 ee a3 45 fa 84 c0 0f 84 52 ff ff ff 48 89 df e8 ae f6 45 fa e9 45 ff ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 53 48
[ 589.827889] RSP: 0018:ffffb1e0872cc660 EFLAGS: 00010202
[ 589.827896] RAX: 0000000000000008 RBX: ffff935334acd300 RCX: 0000000000000001
[ 589.827899] RDX: 37815ffd09b20000 RSI: ffff934b67091000 RDI: ffff935334acd300
[ 589.827901] RBP: ffffb1e0872cc698 R08: ffff9357175114ac R09: 0000000000000001
[ 589.827904] R10: 0000000000000128 R11: 0000000000000178 R12: ffff934b67091000
[ 589.827906] R13: ffff934b67094000 R14: ffff93571578c480 R15: 0000000000000001
[ 589.827909] FS: 0000000000000000(0000) GS:ffff93571fc00000(0000) knlGS:0000000000000000
[ 589.827914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 589.827917] CR2: 00007f3cac01b468 CR3: 000000097160a001 CR4: 00000000007606e0
[ 589.827920] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 589.827922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 589.827924] PKRU: 55555554
[ 589.827927] Call Trace:
[ 589.827930] <IRQ>
[ 589.827942] dev_hard_start_xmit+0x91/0x1f0
[ 589.827953] ? validate_xmit_skb+0x2f0/0x340
[ 589.827965] sch_direct_xmit+0x113/0x340
[ 589.827976] __dev_queue_xmit+0x57e/0x9d0
[ 589.827986] ? reweight_entity+0x16d/0x1b0
[ 589.827995] dev_queue_xmit+0x10/0x20
[ 589.828007] ovs_vport_send+0xa3/0x140 [openvswitch]
[ 589.828014] do_output+0x59/0x170 [openvswitch]
[ 589.828022] do_execute_actions+0x9ae/0x9d0 [openvswitch]
[ 589.828031] ? timerqueue_add+0x9b/0xb0
[ 589.828044] ? enqueue_hrtimer+0x3d/0x90
[ 589.828054] ? ktime_get+0x3e/0xa0
[ 589.828062] ? __update_load_avg_cfs_rq+0x1eb/0x2c0
[ 589.828066] ? attach_entity_load_avg+0x132/0x1a0
[ 589.828071] ? kmem_cache_alloc_node+0x1b3/0x260
[ 589.828079] ovs_execute_actions+0x48/0x110 [openvswitch]
[ 589.828086] ovs_dp_process_packet+0x99/0x1c0 [openvswitch]
[ 589.828101] ? netdev_create+0x40/0x40 [openvswitch]
[ 589.828114] ? ovs_ct_update_key+0x4d/0x110 [openvswitch]
[ 589.828122] ? netdev_create+0x40/0x40 [openvswitch]
[ 589.828130] ovs_vport_receive+0x77/0xd0 [openvswitch]
[ 589.828135] ? __update_load_avg_cfs_rq+0x1eb/0x2c0
[ 589.828139] ? account_entity_enqueue+0xa7/0xd0
[ 589.828149] ? __enqueue_entity+0x96/0xa0
[ 589.828161] ? enqueue_entity+0x116/0x660
[ 589.828170] ? record_times+0x1b/0x90
[ 589.828179] ? native_smp_send_reschedule+0x2a/0x40
[ 589.828190] netdev_frame_hook+0xca/0x190 [openvswitch]
[ 589.828196] __netif_receive_skb_core+0x2db/0xf70
[ 589.828210] ? get_page_from_freelist+0x1dc/0x390
[ 589.828218] ? tcp4_gro_receive+0x136/0x1a0
[ 589.828225] __netif_receive_skb_list_core+0x126/0x2c0
[ 589.828231] netif_receive_skb_list_internal+0x1d5/0x300
[ 589.828237] gro_normal_list.part.0+0x1e/0x40
[ 589.828247] napi_complete_done+0x91/0x140
[ 589.828273] efx_poll+0x282/0x580 [sfc]
[ 589.828280] net_rx_action+0x147/0x3b0
[ 589.828289] __do_softirq+0xe1/0x2d6
[ 589.828297] irq_exit+0xae/0xb0
[ 589.828302] do_IRQ+0x5a/0xf0
[ 589.828306] common_interrupt+0xf/0xf
[ 589.828308] </IRQ>
[ 589.828316] RIP: 0010:cpuidle_enter_state+0xca/0x3e0
答案1
返回标准 Ubuntu 内核(目前为 v5.4):
sudo apt update && sudo apt install linux-generic
sudo apt-get autoremove "linux-image-unsigned-5.5.0-*"
或者,如果您确实需要更高版本,您可以通过安装硬件启用分支来获取相当现代且受支持的(当前为 v5.8)内核:
sudo apt-get install linux-generic-hwe-20.04
导致此问题的内核是可能是规范提供的“主线”构建:一次性二进制文件仅用于帮助您诊断内核问题。不要在生产中运行不受支持的主线版本,并在找出使用它们跟踪的任何错误后立即停止运行它们。
OVS 已被多次损坏并且还会再次损坏,并且您遇到的问题很可能已在所有(发行版或上游)支持的版本中得到修复。
不过,请尝试询问造成此情况的人。
当然,如果服务器的内核被废弃,并且一年内没有受到任何关注,那当然是件很糟糕的事情,但是导致做出这一决定的问题也可能会给业务带来严重影响,而且如果您在内核切换后无法进行测试,而您又重新引入了一个旧的错误,那么后果可能会很糟糕。