我目前正在努力解决 Ubuntu 22.4 上的 tensorflow 训练问题,在调用 model.fit() 之后,在实际训练开始之前,大约 30% 的训练运行中出现以下错误消息:
2023-12-19 08:55:15.324930: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-12-19 08:55:17.453482: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8907
2023-12-19 08:55:17.512133: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-12-19 08:55:38.981980: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fa71c003840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-19 08:55:38.982036: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6
2023-12-19 08:55:38.982045: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (1): NVIDIA RTX A6000, Compute Capability 8.6
2023-12-19 08:55:39.137561: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1702972539.585702 137974 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-12-19 08:55:40.604670: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_ creation via pthread_create() failed.
lsb_release-a:
Linux version:
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
uname -a:
Linux AI1 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
一旦训练真正开始,一切似乎都很好——不再崩溃;这只是训练的开始。不幸的是,我无法在最小示例中重现此问题,也无法分享实际代码。
该机器应具有足够的功能:
nvidia-smi的输出:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:C1:00.0 Off | Off |
| 30% 43C P8 17W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:E1:00.0 Off | Off |
| 30% 28C P8 16W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
lscppu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7713 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 3720.7029
CPU min MHz: 1500.0000
BogoMIPS: 4000.14
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c dp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf x saveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 4 MiB (128 instances)
L1i: 4 MiB (128 instances)
L2: 64 MiB (128 instances)
L3: 512 MiB (16 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
自由的:
total used free shared buff/cache available
Mem: 263896372 4645244 256639004 60716 2612124 257540276
Swap: 268435452 0 268435452
tensorflow 版本:2.15 CUDA:12.2
ulimit -a 的输出:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1030047
max locked memory (kbytes, -l) 32987044
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 1024
cpu time (seconds, -t) unlimited
max user processes (-u) 1030047
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
相同的代码在性能较弱的机器上运行没有问题,崩溃前的 top/free 输出也没有什么异常;dmesg 也是如此。在之前的 tensorflow 版本(<= 2.10)上,有问题的机器也没有问题。
我还尝试通过减少批量大小、仅使用单个 gpu 以及减少数据集的并行处理来减少负载;但没有任何显著的变化。
我目前没有主意,希望能得到一些帮助。是否可以记录引发 EAGAIN 的确切原因?