我最近购买了Dell R320
,Xeon E5-2450 v1
所有固件都使用 更新到最新版本Lifecycle controller
。启动时 dmesg 报告:
microcode: microcode updated early to revision 0x71a, date = 2020-03-24 [ 12.384040] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large: [
12.395572] clocksource: 'hpet' wd_now: 3b1bb82 wd_last: 2e247ff mask: ffffffff [ 12.413476] clocksource: 'tsc' cs_now: 1c62267fd4b cs_last: 1c30b8dcf7f mask: ffffffffffffffff [ 12.425567] tsc: Marking TSC unstable due to clocksource watchdog [
12.431666] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
然后,如果我运行phoronix-test-suite stress-run stress-ng
系统,大约一分钟后就会变得无响应。
在测试期间,我看到来自网络适配器的看门狗事件:
[ 705.412997] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out
[ 705.412997] WARNING: CPU: 9 PID: 6812 at net/sched/sch_generic.c:473 dev_watchdog+0x27d/0x281
[ 705.412997] Modules linked in: xt_CHECKSUM ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set tun rfkill scsi_transport_iscsi ip_set xt_conntrack xt_multiport xt_nat xt_addrtype xt_mark xt_MASQUERADE nft_counter xt_comment nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth sunrpc iTCO_wdt intel_rapl_msr iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel drm_vram_helper aesni_intel ttm crypto_simd cryptd glue_helper drm_kms_helper pcspkr drm syscopyarea sysfillrect sysimgblt fb_sys_fops lpc_ich i2c_algo_bit zfs(POE) joydev zunicode(POE) zzstd(OE) zlua(OE) mei_me zavl(POE) mei icp(POE) zcommon(POE) znvpair(POE) ipmi_ssif spl(OE) ioatdma dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter
[ 705.412997] sch_fq_codel ip_tables xfs libcrc32c sd_mod sg ahci libahci libata mpt3sas tg3 raid_class scsi_transport_sas wmi fuse
[ 705.412997] CPU: 9 PID: 6812 Comm: stress-ng Kdump: loaded Tainted: P OE 5.4.17-2136.300.7.el8uek.x86_64 #2
[ 705.412997] Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
[ 705.412997] RIP: 0010:dev_watchdog+0x27d/0x281
[ 705.412997] Code: 48 85 c0 75 e6 eb a0 4c 89 e7 c6 05 9b 59 17 01 01 e8 c7 a9 fa ff 89 d9 4c 89 e6 48 c7 c7 68 3b 53 ac 48 89 c2 e8 be f1 82 ff <0f> 0b eb 82 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66
[ 705.412997] RSP: 0000:ffffac6d003d0e50 EFLAGS: 00010282
[ 705.412997] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[ 705.412997] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9e853f457d00
[ 705.412997] RBP: ffffac6d003d0e80 R08: 0000000000000514 R09: 00000000ffffffff
[ 705.412997] R10: 0000000000000000 R11: ffff9e851d84f3d0 R12: ffff9e850d8e4000
[ 705.412997] R13: 0000000000000005 R14: ffff9e850d8e4480 R15: ffff9e8537d377c0
[ 705.412997] FS: 00007fa4baba5740(0000) GS:ffff9e853f440000(0000) knlGS:0000000000000000
[ 705.412997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 705.412997] CR2: 00007f54983fad0c CR3: 0000000b99992006 CR4: 00000000000606e0
[ 705.412997] Call Trace:
[ 705.412997] <IRQ>
[ 705.412997] ? pfifo_fast_enqueue+0x160/0x151
[ 705.412997] call_timer_fn+0x32/0x12c
[ 705.412997] run_timer_softirq+0x1a5/0x42e
[ 705.412997] __do_softirq+0xe1/0x2e7
[ 705.412997] ? hrtimer_interrupt+0x12a/0x222
[ 705.412997] irq_exit+0xf3/0xf8
[ 705.412997] smp_apic_timer_interrupt+0x79/0x130
[ 705.412997] apic_timer_interrupt+0xf/0x14
[ 705.412997] </IRQ>
如果我mitigations = off
在启动时添加内核命令行参数,phoronix
则持续 4 到 7 分钟,然后系统再次无响应。KVM 客户机也发生同样的情况,尝试安装Debian 11
5 次,在初始软件包安装或内核解包期间安装冻结。
冻结消息屏幕: https://ibb.co/k2Jk4QG
有人遇到过类似的问题吗?谢谢!
PS:当前内核5.4.17-2136.300.7.el8uek.x86_64
,也尝试过,4.18.0-305.19.1.el8_4.x86_64
没有任何区别
答案1
将 CPU 切换到 E5-2470v2 解决了这个问题,看来之前的 CPU 有点坏了。