DELL R320、Xeon E5-2450 v1、Oracle Linux 8 将时钟源“tsc”标记为不稳定,在负载下随机崩溃

DELL R320、Xeon E5-2450 v1、Oracle Linux 8 将时钟源“tsc”标记为不稳定,在负载下随机崩溃

我最近购买了Dell R320Xeon E5-2450 v1所有固件都使用 更新到最新版本Lifecycle controller。启动时 dmesg 报告:

microcode: microcode updated early to revision 0x71a, date = 2020-03-24 [   12.384040] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large: [  
12.395572] clocksource:                       'hpet' wd_now: 3b1bb82 wd_last: 2e247ff mask: ffffffff [   12.413476] clocksource:            'tsc' cs_now: 1c62267fd4b cs_last: 1c30b8dcf7f mask: ffffffffffffffff [   12.425567] tsc: Marking TSC unstable due to clocksource watchdog [
12.431666] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.

然后,如果我运行phoronix-test-suite stress-run stress-ng系统,大约一分钟后就会变得无响应。

在测试期间,我看到来自网络适配器的看门狗事件:

[  705.412997] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out
[  705.412997] WARNING: CPU: 9 PID: 6812 at net/sched/sch_generic.c:473 dev_watchdog+0x27d/0x281
[  705.412997] Modules linked in: xt_CHECKSUM ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set tun rfkill scsi_transport_iscsi ip_set xt_conntrack xt_multiport xt_nat xt_addrtype xt_mark xt_MASQUERADE nft_counter xt_comment nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth sunrpc iTCO_wdt intel_rapl_msr iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel vfat fat kvm irqbypass crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel drm_vram_helper aesni_intel ttm crypto_simd cryptd glue_helper drm_kms_helper pcspkr drm syscopyarea sysfillrect sysimgblt fb_sys_fops lpc_ich i2c_algo_bit zfs(POE) joydev zunicode(POE) zzstd(OE) zlua(OE) mei_me zavl(POE) mei icp(POE) zcommon(POE) znvpair(POE) ipmi_ssif spl(OE) ioatdma dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter
[  705.412997]  sch_fq_codel ip_tables xfs libcrc32c sd_mod sg ahci libahci libata mpt3sas tg3 raid_class scsi_transport_sas wmi fuse
[  705.412997] CPU: 9 PID: 6812 Comm: stress-ng Kdump: loaded Tainted: P           OE     5.4.17-2136.300.7.el8uek.x86_64 #2
[  705.412997] Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
[  705.412997] RIP: 0010:dev_watchdog+0x27d/0x281
[  705.412997] Code: 48 85 c0 75 e6 eb a0 4c 89 e7 c6 05 9b 59 17 01 01 e8 c7 a9 fa ff 89 d9 4c 89 e6 48 c7 c7 68 3b 53 ac 48 89 c2 e8 be f1 82 ff <0f> 0b eb 82 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66
[  705.412997] RSP: 0000:ffffac6d003d0e50 EFLAGS: 00010282
[  705.412997] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[  705.412997] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9e853f457d00
[  705.412997] RBP: ffffac6d003d0e80 R08: 0000000000000514 R09: 00000000ffffffff
[  705.412997] R10: 0000000000000000 R11: ffff9e851d84f3d0 R12: ffff9e850d8e4000
[  705.412997] R13: 0000000000000005 R14: ffff9e850d8e4480 R15: ffff9e8537d377c0
[  705.412997] FS:  00007fa4baba5740(0000) GS:ffff9e853f440000(0000) knlGS:0000000000000000
[  705.412997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  705.412997] CR2: 00007f54983fad0c CR3: 0000000b99992006 CR4: 00000000000606e0
[  705.412997] Call Trace:
[  705.412997]  <IRQ>
[  705.412997]  ? pfifo_fast_enqueue+0x160/0x151
[  705.412997]  call_timer_fn+0x32/0x12c
[  705.412997]  run_timer_softirq+0x1a5/0x42e
[  705.412997]  __do_softirq+0xe1/0x2e7
[  705.412997]  ? hrtimer_interrupt+0x12a/0x222
[  705.412997]  irq_exit+0xf3/0xf8
[  705.412997]  smp_apic_timer_interrupt+0x79/0x130
[  705.412997]  apic_timer_interrupt+0xf/0x14
[  705.412997]  </IRQ>

如果我mitigations = off在启动时添加内核命令行参数,phoronix则持续 4 到 7 分钟,然后系统再次无响应。KVM 客户机也发生同样的情况,尝试安装Debian 115 次,在初始软件包安装或内核解包期间安装冻结。

冻结消息屏幕: https://ibb.co/k2Jk4QG

有人遇到过类似的问题吗?谢谢!

PS:当前内核5.4.17-2136.300.7.el8uek.x86_64,也尝试过,4.18.0-305.19.1.el8_4.x86_64没有任何区别

答案1

将 CPU 切换到 E5-2470v2 解决了这个问题,看来之前的 CPU 有点坏了。

相关内容