调试意外的系统关闭

调试意外的系统关闭

笔记:

原始帖子被截断以符合 StackExchange 帖子(问题的答案)的精神。但是,当我描述我的过程时,我所做的日志仍然很有价值:

我已将这篇文章的最后“日志状态”存档在我的博客中:https://erotemic.wordpress.com/2021/10/01/debugging-unexpected-system-shutdown-initial-archive/将来我会在博客上发布更新内容,我也会编辑这个超级用户问题,找到核心问题以及解决这些问题的实验的描述。

与此同时,我删除了除最新更新之外的所有内容。我想整个帖子都会被彻底改组。


症状

运行自定义 Pytorch 脚本时,我的机器遇到硬关机。

我拍摄了三个视频来演示这个问题:

https://www.youtube.com/watch?v=Ue4XHcusqto

https://www.youtube.com/watch?v=LPwaI1SRlXk

https://www.youtube.com/watch?v=yQ7i-8Kp6xg


调试步骤和结果摘要

  • 关机时测量的瓦数在限制范围内,瓦数可能性大幅降低。
  • 关机时测量的热量,完全在 CPU/GPU 的限制范围内,没有严重异常,热可能性大幅降低。
  • 整夜运行 MemTest86+:所有测试均通过。由 RAM 故障导致的问题可能性已基本排除。
  • 将 1600W PSU 换成相同型号的 1000W PSU。仍然发生关机。问题出在 PSU 故障上的可能性已基本排除。
  • 仅在 PCIE 插槽 #1 和 #3 中运行 1080ti,两种情况下仍会关机。3090 出现故障的可能性大大降低。
  • 仅将 3090 连接到 PCIE 插槽 #1 时,仍然会关机。1080ti 出现故障的可能性大大降低。
  • 运行了不同的 ML 脚本,没有发生关机,我的自定义 ML 脚本包含该问题的概率增加了。
  • 在 CPU/GPU 上运行了大型压力测试,关机似乎仅在使用我的自定义 ML 脚本时发生。

罪魁祸首已被有效排除

  • 热量
  • 瓦数
  • 电源
  • 图形处理器
  • 内存

潜在的罪魁祸首和待办事项:

  • 对自定义 ML 脚本进行二分查找,以找到导致关机的 MWE
  • 主板问题?
  • CPU 问题?
  • 存储问题?(不太可能)

潜在的解决方案!?

我已更新我的博客,添加了更多信息。要点是,我找到了一个 BIOS 设置:ASUS MultiCore Enhancement: Auto并将其设置为Disabled,似乎可以解决问题。我进行了超过 14 小时的实验,没有断电。

原帖部分内容:待整理


我正在尝试调试反复发生的意外系统关闭,这种情况有时会在机器负载过大时发生,但我无法让它可靠地发生。我目前的假设是:

  • 从墙上汲取太多电力
  • 热问题
  • 未发现的硬件问题

硬件+软件+工作负载

我的机器上的硬件列表可以在这里找到:https://pcpartpicker.com/user/erotemic/saved/#view=WKpmD3

相关部分是:

  • CPU:英特尔 i9-11900K,配备 Noctua NH-d15 空气冷却器
  • GPU0:RTX 3090 MSI Trio(连接至显示器)
  • GPU1:~GTX 1080ti~ 升级到第二个 RTX 3090 EVGA XC3 Hybrid。
  • 电源:EVGA T2 1600 W 80+ 钛金版

我正在运行原版 Ubuntu 21.04

我将用几种不同的工作负载来对机器进行压力测试。

  • ethermine——使用两种 GPU。
  • BOINC - 与climateprediction.net 和 World Community Grid 配合使用(只要机器未使用,就设置为使用 90% 的 CPU)
  • 使用 PyTorch 自定义机器学习工作流程。

我最近没有使用 ethermine,我一直在运行我的 ML 工作负载。

瓦数假说

我测量了系统的瓦数,用 Kill-O-Watt P3 测量,它消耗大约 700-800 瓦(这包括显示器和插入电涌保护器的其他所有设备)。我住在一栋改建成公寓的美国老建筑里。因此,我不能 100% 确定电路的容量,但假设一切都符合规范(但我并不确信它真的符合)电路应该能够承受 1800 瓦。房间里的其他电子设备包括一盏 10 瓦的灯和一台 989 瓦的空调。因此,这正好达到 1800 瓦的极限。起初我确信这一定是罪魁祸首,但有一天晚上天气凉爽时,我开始工作并拔掉了空调插头,早上电源就关闭了,所以这个假设不再能很好地解释这些症状。

此外,我认为我的廉价“Quirky Pivot Power”电涌保护器可能有问题,因此我订购了 Tripp Lite ISOBAR6Ultra,希望它质量更高,但它还没有到货,而且我认为这不是问题所在。

热假说

我目前更倾向于认为问题出在热量上,但是当我搜索日志时,我没有看到任何与热量相关关机相关的信息。

我一直在使用 psensor 监控温度并每 300 秒将日志转储到磁盘(因此记录的温度可能不包括导致关机的高温)。

我绘制了最近一次关机(大约发生在 2021-08-18 凌晨 3:00)时记录的温度图表:

在此处输入图片描述

请注意,为了防止出现此类问题,我故意不在这里使用 RTX 3090,但似乎即使是 1080ti 运行也会触发导致此关闭的任何情况。

CPU 在此处记录的最高温度为 93C,但我曾看到温度记录的最高温度接近 99C,而“传感器”报告的临界温度为 100C。因此,考虑到 CPU 温度在关机发生之前正在升高,并且记录间隔为每 5 分钟,很可能系统在下一次记录发生之前就达到了临界温度并关机。

但我仍然对此不满意。首先journalctl -g 'temperature|critical' -b -2按照建议运行https://unix.stackexchange.com/questions/502226/how-do-you-find-out-if-a-linux-machine-overheated-before-the-previous-boot-and-w没有迹象表明系统记录了温度问题。

的结果journalctl -b -1

Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 71
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 72
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 19
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 74
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 20 to 23
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 67 to 77
Aug 18 02:46:57 toothbrush smartd[1857]: Device: /dev/sdd [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 27 to 23
Aug 18 02:47:00 toothbrush boinc[3170]: 18-Aug-2021 02:47:00 [---] Suspending computation - CPU is busy
Aug 18 02:47:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r344465819 t8087795, 64bit:1), syncing.
Aug 18 02:47:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r398433847 t7744494, 64bit:1), syncing.
Aug 18 02:47:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r349747452 t8371229, 64bit:1), syncing.
Aug 18 02:48:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r390229100 t7980049, 64bit:1), syncing.
Aug 18 02:48:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r352409333 t7226854, 64bit:1), syncing.
Aug 18 02:48:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r508920330 t10538384, 64bit:1), syncing.
Aug 18 02:48:50 toothbrush boinc[3170]: 18-Aug-2021 02:48:50 [---] Resuming computation
Aug 18 02:49:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r261199946 t4894398, 64bit:1), syncing.
Aug 18 02:49:01 toothbrush boinc[3170]: 18-Aug-2021 02:49:01 [---] Suspending computation - CPU is busy
Aug 18 02:49:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r251680223 t6509690, 64bit:1), syncing.
Aug 18 02:49:21 toothbrush boinc[3170]: 18-Aug-2021 02:49:21 [---] Resuming computation
Aug 18 02:49:31 toothbrush boinc[3170]: 18-Aug-2021 02:49:31 [---] Suspending computation - CPU is busy
Aug 18 02:49:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r346528983 t5840449, 64bit:1), syncing.
Aug 18 02:50:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r560923145 t12173867, 64bit:1), syncing.
Aug 18 02:50:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r567474866 t11497897, 64bit:1), syncing.
Aug 18 02:50:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r519892497 t10585216, 64bit:1), syncing.
Aug 18 02:51:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r553040012 t11503711, 64bit:1), syncing.
Aug 18 02:51:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r489967052 t11999909, 64bit:1), syncing.
Aug 18 02:51:31 toothbrush boinc[3170]: 18-Aug-2021 02:51:31 [---] Resuming computation
Aug 18 02:51:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r279491189 t4690385, 64bit:1), syncing.
Aug 18 02:51:41 toothbrush boinc[3170]: 18-Aug-2021 02:51:41 [---] Suspending computation - CPU is busy
Aug 18 02:52:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r233899151 t4847426, 64bit:1), syncing.
Aug 18 02:52:01 toothbrush boinc[3170]: 18-Aug-2021 02:52:01 [---] Resuming computation
Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session opened for user root by (uid=0)
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:01 toothbrush CRON[313877]: pam_unix(cron:session): session closed for user root
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation

cat /var/log/syslog接近关机的结果是:

Aug 18 02:52:11 toothbrush boinc[3170]: 18-Aug-2021 02:52:11 [---] Suspending computation - CPU is busy
Aug 18 02:52:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r268957755 t5537306, 64bit:1), syncing.
Aug 18 02:52:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r333913668 t7187733, 64bit:1), syncing.
Aug 18 02:53:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r450294755 t8957939, 64bit:1), syncing.
Aug 18 02:53:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r264028304 t5582071, 64bit:1), syncing.
Aug 18 02:53:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r379501357 t8308167, 64bit:1), syncing.
Aug 18 02:54:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r364408338 t9670592, 64bit:1), syncing.
Aug 18 02:54:12 toothbrush boinc[3170]: 18-Aug-2021 02:54:12 [---] Resuming computation
Aug 18 02:54:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r410359086 t6437227, 64bit:1), syncing.
Aug 18 02:54:22 toothbrush boinc[3170]: 18-Aug-2021 02:54:22 [---] Suspending computation - CPU is busy
Aug 18 02:54:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r266936223 t4903133, 64bit:1), syncing.
Aug 18 02:54:42 toothbrush boinc[3170]: 18-Aug-2021 02:54:42 [---] Resuming computation
Aug 18 02:54:52 toothbrush boinc[3170]: 18-Aug-2021 02:54:52 [---] Suspending computation - CPU is busy
Aug 18 02:55:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r258961514 t5642594, 64bit:1), syncing.
Aug 18 02:55:01 toothbrush CRON[313878]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 02:55:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r485119089 t10059003, 64bit:1), syncing.
Aug 18 02:55:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r482961424 t9750792, 64bit:1), syncing.
Aug 18 02:56:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r334697691 t7035018, 64bit:1), syncing.
Aug 18 02:56:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r449591310 t9490996, 64bit:1), syncing.
Aug 18 02:56:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r415820654 t10568703, 64bit:1), syncing.
Aug 18 02:56:43 toothbrush boinc[3170]: 18-Aug-2021 02:56:43 [---] Resuming computation
Aug 18 02:56:53 toothbrush boinc[3170]: 18-Aug-2021 02:56:53 [---] Suspending computation - CPU is busy
Aug 18 02:57:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r326675026 t4890602, 64bit:1), syncing.
Aug 18 02:57:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r461180383 t10357149, 64bit:1), syncing.
Aug 18 02:57:23 toothbrush boinc[3170]: 18-Aug-2021 02:57:23 [---] Resuming computation
Aug 18 02:57:33 toothbrush boinc[3170]: 18-Aug-2021 02:57:33 [---] Suspending computation - CPU is busy
Aug 18 02:57:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r311496530 t5584467, 64bit:1), syncing.
Aug 18 02:58:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r345401175 t6977056, 64bit:1), syncing.
Aug 18 02:58:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r413257951 t8468887, 64bit:1), syncing.
Aug 18 02:58:40 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r430901546 t9350168, 64bit:1), syncing.
Aug 18 02:59:00 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r316409469 t6532987, 64bit:1), syncing.
Aug 18 02:59:20 toothbrush vnstatd[1944]: Info: Traffic rate for "tun0" higher than set maximum 10 Mbit (20s->27262976, r496502797 t11915940, 64bit:1), syncing.
Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'lp'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'ppdev'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'parport_pc'
Aug 18 09:25:52 toothbrush systemd-modules-load[472]: Inserted module 'msr'
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush lvm[461]:   2 logical volume(s) in volume group "vgubuntu" monitored
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:
Aug 18 09:25:52 toothbrush systemd[1]: Starting Flush Journal to Persistent Storage...
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Intel GenuineIntel
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   AMD AuthenticAMD
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Hygon HygonGenuine
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   Centaur CentaurHauls
Aug 18 09:25:52 toothbrush kernel: [    0.000000]   zhaoxin   Shanghai  
Aug 18 09:25:52 toothbrush systemd[1]: Finished Load Kernel Modules.

这里有趣的是,关机前的最后一条日志是Aug 18 02:59:24 toothbrush boinc[3170]: 18-Aug-2021 02:59:24 [---] Resuming computation,表明 BOINC 即将开始运行 CPU 密集型进程。

运行cat /var/log/kern.log并查看附近的时间会提供较少的信息:

Aug 17 23:47:21 toothbrush kernel: [100858.782842] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
Aug 17 23:47:21 toothbrush kernel: [100858.782850] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 17 23:47:21 toothbrush kernel: [100858.782851] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00000001/00002000
Aug 17 23:47:21 toothbrush kernel: [100858.782852] pcieport 0000:00:01.0:    [ 0] RxErr                  (First)
Aug 18 00:00:01 toothbrush kernel: [101618.605604] audit: type=1400 audit(1629259201.304:83): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=3302495 comm="cupsd" capability=12  capname="net_admin"
Aug 18 00:00:05 toothbrush kernel: [101622.407042] audit: type=1400 audit(1629259205.104:84): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=3302502 comm="cups-browsed" capability=23  capname="sys_nice"
Aug 18 09:25:52 toothbrush kernel: [    0.000000] microcode: microcode updated early to revision 0x40, date = 2021-04-11
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Linux version 5.11.0-25-generic (buildd@lgw01-amd64-044) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 (Ubuntu 5.11.0-25.27-generic 5.11.22)
Aug 18 09:25:52 toothbrush kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
Aug 18 09:25:52 toothbrush kernel: [    0.000000] KERNEL supported cpus:

跑步:last -x | head | tac

joncrall :0           :0               Mon Aug 16 19:46 - crash (1+13:38)
runlevel (to lvl 5)   5.11.0-25-generi Mon Aug 16 19:47 - 09:26 (1+13:39)
joncrall pts/3        tmux(11727).%0   Mon Aug 16 20:41 - 21:49  (01:07)
joncrall pts/23       tmux(3215922).%0 Tue Aug 17 23:46 - crash  (09:39)
reboot   system boot  5.11.0-25-generi Wed Aug 18 09:25   still running
joncrall :0           :0               Wed Aug 18 09:25   still logged in
runlevel (to lvl 5)   5.11.0-25-generi Wed Aug 18 09:26   still running

我忘记了“崩溃”与“仍在运行”列的具体含义last reboot,所以我不确定如何解释这一点,或者这里是否有任何诊断信息。

因此,如果是热量,我认为系统不会记录它。

我的问题+摘要:

所以,我的机器关机了,我不确定是散热问题、电源问题还是其他问题。为了缓解散热问题,我在我的机箱中安装了额外的 4 个风扇,其中前下部有 2 个进气风扇,前下部有 1 个进气风扇,后上部有 2 个排气风扇,后部有 1 个排气风扇。NH-d15 上安装了两个风扇(我仔细检查了方向)。

  • 是否有其他日志可以检查来调试热问题?

  • 我使用风冷是不是太傻了,这可能只是温度波动,可以通过 AOI CPU 水冷器来缓解吗?

  • 还有其他我没有考虑到的假设吗?

更新 2021-10-01

十月快乐。我的机器仍然让我抓狂。但至少我有几个角度可以一分为二,尝试找出问题所在。

我重新配置了硬件,试图确定 3090 是否是问题的一部分,但我认为不是。

我完全移除了 3090,所以 1080ti 现在是里面唯一的显卡。我没有更改 1080ti 连接的 PCIE 插槽。以前 3090 位于插槽 1/3(最靠近 CPU 的插槽),而 1080 ti 位于插槽 3/3(最远的插槽)。我只是移除了 3090,并将 1080ti 保留在插槽 3 上。我连接了 DVI 电缆,启动并运行了 pytorch 训练代码。我于 2021-09-30 晚上 10:18 开始训练,当我上床睡觉时它仍在运行,但我醒来时发现机器已关闭。查看日志,它似乎在 2021-10-01 凌晨 2:14 左右关闭了电源,因此在遇到问题之前它能够运行近 4 个小时。

因此,即使没有 3090,问题仍然存在(价格过高的 GPU 不是问题所在),尽管使用 3090 似乎确实会引发该问题快点,但这不是根本原因。

我想知道我是否可能发现了我的硬件和我正在进行的训练类型的漏洞。希望我能找到 MWE,这样我就可以指出导致这种情况的特定指令(回想一下,使用 torch / tensorflow stock 脚本运行标准 ConvNet 训练不会触发此问题,我现在正在运行的代码是使用 pytorch-lightning 训练 transformer 网络)。

在执行此操作之前,我将在打开机器的情况下尝试更多硬件配置。

后来:在插槽 0 中重现了 1080ti 的错误。我想下一个测试是尝试更换电源。这会破坏我的电缆管理,但这应该可以排除 PSU 或专注于它。

后来:不是电源的问题。我将其换成了 1000W 电源,并在 7:43 进行了实验。7:57 关机。那么 torch-exploit、CPU、主板还是其他问题?初始版本的内存有问题,但已更换。我会重新运行。

答案1

我在 2021-10-03 找到的潜在解决方案已经解决了这个问题!我已经运行了 17 天,没有任何问题。

问题出在我的华硕 ROG STRIX Z590-E GAMING WIFI ATX LGA1200 主板的 BIOS 超频设置上。

在此处输入图片描述

BIOS 设置:ASUS MultiCore Enhancement:最初设置为“自动”,将其设置为“禁用”解决了我的问题。

我猜测 Ai Tweaker 针对游戏进行了优化,而不是针对科学工作负载。

相关内容