有没有办法知道真正导致特定段错误的原因？

2024-6-10 • tag-icon

我最近买了一块新主板/CPU（华硕 Rog Strix z690-a gaming wifi D4/Intel i7 12700K），最初使用的是我之前的 z370 主板时安装的现有 ubuntu 20.04。
在初次使用（训练深度学习模型）期间，我注意到在看似随机的间隔后（有时在开始训练几分钟后，有时在几个小时后）突然出现段错误，或者我的 NVME 驱动器突然消失，导致我遇到如下错误：

我想这可能是因为之前的安装，所以我继续安装了全新的 ubuntu 20.04，安装了最新的 nvidia 驱动程序 (515)，并满足所需的要求 (pytorch 1.11、anaconda3)，然后重新启动。我遇到了一次系统挂起（我正在浏览最新版本的 Firefox，这时一切都冻结了，什么都没有响应，甚至 ctrl+alt+Fs 也不起作用），所以我不得不硬重置。重新启动后恢复训练，然后又发生了一次段错误，如下所示：
请注意 RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault.下面的部分：

Train: 22 [1200/5004 ( 24%)]  Loss: 3.231 (3.24)  Time-Batch: 0.110s, 2325.76/s  LR: 1.000e-01  Data: 0.003 (0.130)
Train: 22 [1400/5004 ( 28%)]  Loss: 3.278 (3.24)  Time-Batch: 0.102s, 2500.91/s  LR: 1.000e-01  Data: 0.002 (0.128)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
    main()
  File "/home/hossein/pytorch-image-models/train.py", line 525, in main
    train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
  File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
    loss_scaler(loss, optimizer,
  File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2405) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-08_20:28:21
  host      : hossein-pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2405)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

值得注意的是，在升级之前，这个脚本连续一周没有出现任何问题，所以我 99% 确定这个脚本没有问题。
此外，我还在一个设置下成功运行了 2 个半小时的 Aida64 压力测试（CPU/FPU/Cache），在其他情况下成功运行了多个 1 小时。
我还memtest成功运行了 6 个小时（基本上是所有默认测试，共进行了 4 次）。我在升级 BIOS 之前就遇到了这些问题，现在更新到最新的 BIOS 版本后，我仍然面临这个段错误。

此时，我完全不知道原因是什么。我还安装了 22.04，它在训练中也冻结了，此时我再次重新安装了 ubuntu 20.04，这一切都发生在今天！

以下是我之前所犯的错误，以防它们很重要：

出现奇怪的错误，其中位于我的 nvme 驱动器中的数据集的路径已损坏，请注意该错误：

FileNotFoundError: [Errno 2] No such file or directory: /media/hossein/SSE/ImageNdt_DataS`t/trainjn036970p7/n03693007_276q.JPEG

正确的路径是

/media/hossein/SSD/ImageNet_DataSet/train/n03697007/n03697007_2760.JPEG

在我看来，这看起来像是内存问题，我想可能是由于过热，我的 nvme 出了问题，所以在这之后，我将 nvme 驱动器（三星 980 1tb）安装到主板下方的另一个插槽（之前它安装在 CPU 插槽和显卡端口之间，这导致温度非常高，大约 65/76c）。在这之后，我更新了 bios。

连接中止错误（nvme 驱动器突然消失（上图））：

Train: 0 [2200/5004 ( 44%)]  Loss: 6.439 (6.79)  Time-Batch: 0.120s, 2135.13/s  LR: 1.000e-01  Data: 0.006 (0.092)
Train: 0 [2400/5004 ( 48%)]  Loss: 6.337 (6.76)  Time-Batch: 0.118s, 2164.70/s  LR: 1.000e-01  Data: 0.003 (0.092)
WARNING: Skipped sample (index 1068111, file n04347754/n04347754_93404.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 349727, file n02115641/n02115641_30352.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 910908, file n03908714/n03908714_3517.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 894431, file n03877472/n03877472_17451.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 779988, file n03590841/n03590841_10648.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 213196, file n02089078/n02089078_8336.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 629596, file n03000134/n03000134_6084.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 1221601, file n07753592/n07753592_1779.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 1089611, file n04399382/n04399382_31586.JPEG). [Errno 107] Transport endpoint is not connected
Traceback (most recent call last):
 ...
  raise exception
ConnectionAbortedError: Caught ConnectionAbortedError in DataLoader worker process 10.
 ...
ConnectionAbortedError: [Errno 103] Software caused connection abort: '/media/hossein/SSD/ImageNet_DataSet/train/n02883205/n02883205_6142.JPEG'

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3721) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
...

另一个分段错误：

Train: 0 [1600/5004 ( 32%)]  Loss: 6.612 (6.85)  Time-Batch: 0.103s, 2492.08/s  LR: 1.000e-01  Data: 0.002 (0.099)
Train: 0 [1800/5004 ( 36%)]  Loss: 6.658 (6.82)  Time-Batch: 0.108s, 2376.50/s  LR: 1.000e-01  Data: 0.008 (0.099)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
    main()
  File "/home/hossein/pytorch-image-models/train.py", line 525, in main
    train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
  File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
    loss_scaler(loss, optimizer,
  File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 48, in __call__
    self._scaler.step(optimizer)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in _maybe_opt_step
    if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in <genexpr>
    if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6063) is killed by signal: Segmentation fault. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5934) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-04_18:17:11
  host      : hossein-pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5934)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

有什么方法可以让我确切知道哪个部分导致了段错误？

边注：
在 Windows 下，使用硬盘哨兵，我的 nvme 驱动器 100% 正常。此外，由于我有 32GB 的 RAM，因此我完全禁用了交换文件。（这可能与段错误有关吗？）

以下是 ubuntu 日志应用程序中 dmesg、dmesg.0 的日志内容以及重要和硬件类别的内容：

dmesg内容：https://pastebin.com/fctcEmnB
dmesg.0：https://pastebin.com/mmvR8hSV
日志内容-重要：https://pastebin.com/NsVgsxYx
日志内容硬件：https://pastebin.com/cYyPCgCL

以下是 smartctl 的输出：

(base) hossein@hossein-pc:~$ sudo smartctl -a -x /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.13.0-48-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 1TB
Serial Number:                      S649NJ0R331701H
Firmware Version:                   2B4QFXO7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      5
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            465,389,219,840 [465 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 d311422bf4
Local Time is:                      Thu Jun  9 10:11:14 2022 +0430
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x10):        *Other*

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.24W       -        -    0  0  0  0        0       0
 1 +     4.49W       -        -    1  1  1  1        0       0
 2 +     2.19W       -        -    2  2  2  2        0     500
 3 -   0.0500W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    62,083,877 [31.7 TB]
Data Units Written:                 7,758,068 [3.97 TB]
Host Read Commands:                 626,314,400
Host Write Commands:                90,214,169
Controller Busy Time:               1,148
Power Cycles:                       290
Power On Hours:                     1,668
Unsafe Shutdowns:                   49
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               38 Celsius
Thermal Temp. 2 Transition Count:   185
Thermal Temp. 2 Total Time:         54

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

更新 3：

我收到了一些新的错误日志，这是今天发生的，似乎与以下情况有关ntfs-3g：

答案1

好的，这是更新。
所有这些问题都是由于我的内存时钟频率为 2400mhz 引起的。（XMP 被禁用了！）因此我怀疑这是由内存问题引起的。

根据我的研究，对于英特尔第 12 代主板/CPU，有效时钟为 3200mhz（这是经过测试的，应该可以完美运行）。低至 2400mhz 的速度似乎可以工作！但在重负载下并非没有问题。

就在那时，我继续激活了XMP配置文件，我的内存开始以宣传的速度（即 3000mhz）工作，我将其超频到 3200mhz 和 3600mhz，并发现从 3000mhz 开始问题就消失了，所以现在不需要对 3000mhz 以上的内存施加压力！

因此，请始终激活 XMP 并注意速度。如果您的系统工作负载密集型，其中 CPU/GPU/RAM/DISK 全天候处于 100% 利用率，那么这是必须的，否则您可能不会立即看到任何问题。

请注意，我在 Windows 下没有遇到任何问题，即使我长时间运行基准测试，并单独测试每个组件，也没有出现任何问题。只有当整个系统负载过重时，它才会露出丑陋的头！

另外，为什么我一开始不去激活它XMP，我以为主板和系统整体会通过坚持默认设置（优化！华硕称之为）工作得更好，如果我使用 XMP（因此有任何超频（顺便说一句，我没有用过），它会引入错误，因为整个主板和架构仍然很新，我不想有任何麻烦！）显然我错了！总是先设置 XMP！

更新 3：

答案1

相关内容