Ubuntu 22.04 Tensorflow 崩溃：phtread_create 中的 EAGAIN

2024-6-11 • tag-icon

Ubuntu 22.04 Tensorflow 崩溃：phtread_create 中的 EAGAIN

我目前正在努力解决 Ubuntu 22.4 上的 tensorflow 训练问题，在调用 model.fit() 之后，在实际训练开始之前，大约 30％的训练运行中出现以下错误消息：

2023-12-19 08:55:15.324930: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory                                                                                                                          

2023-12-19 08:55:17.453482: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8907                                                                                                                                                         

2023-12-19 08:55:17.512133: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory                                                                                                                          

2023-12-19 08:55:38.981980: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fa71c003840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:                                                                              

2023-12-19 08:55:38.982036: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6                                                                                                                           

2023-12-19 08:55:38.982045: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA RTX A6000, Compute Capability 8.6                                                                                                                           

2023-12-19 08:55:39.137561: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.                                                                                      

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR                                                                                                                                                                                       

I0000 00:00:1702972539.585702  137974 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.                                                                                                                   

2023-12-19 08:55:40.604670: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_ creation via pthread_create() failed.

lsb_release-a：

Linux version: 
Distributor ID: Ubuntu                                                                                                                                                                                                                                                        
Description:    Ubuntu 22.04.3 LTS                                                                                                                                                                                                                                            
Release:        22.04                                                                                                                                                                                                                                                         
Codename:       jammy

uname -a：

Linux AI1 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

一旦训练真正开始，一切似乎都很好——不再崩溃；这只是训练的开始。不幸的是，我无法在最小示例中重现此问题，也无法分享实际代码。

该机器应具有足够的功能：

nvidia-smi的输出：

+---------------------------------------------------------------------------------------+                                                                                                                                                                                    

 | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |                                                                                                                                                                                     

|-----------------------------------------+----------------------+----------------------+                                                                                                                                                                                     

| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |                                                                                                                                                                                     

| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |                                                                                                                                                                                     

|                                         |                      |               MIG M. |                                                                                                                                                                                     

|=========================================+======================+======================|                                                                                                                                                                                    

 |   0  NVIDIA RTX A6000               On  | 00000000:C1:00.0 Off |                  Off |                                                                                                                                                                                     

| 30%   43C    P8              17W / 300W |      3MiB / 49140MiB |      0%      Default |                                                                                                                                                                                    

 |                                         |                      |                  N/A |                                                                                                                                                                                    

 +-----------------------------------------+----------------------+----------------------+                                                                                                                                                                                     

|   1  NVIDIA RTX A6000               On  | 00000000:E1:00.0 Off |                  Off |                                                                                                                                                                                     

| 30%   28C    P8              16W / 300W |      3MiB / 49140MiB |      0%      Default |                                                                                                                                                                                    

 |                                         |                      |                  N/A |                                                                                                                                                                                    

 +-----------------------------------------+----------------------+----------------------+

lscppu：

Architecture:            x86_64                                                                                                                                                                                                                                                 
CPU op-mode(s):        32-bit, 64-bit                                                                                                                                                                                                                                        
Address sizes:         48 bits physical, 48 bits virtual                                                                                                                                                                                                                     
Byte Order:            Little Endian                                                                                                                                                                                                                                        
CPU(s):                  256                                                                                                                                                                                                                                                    
On-line CPU(s) list:   0-255                                                                                                                                                                                                                                               
Vendor ID:               AuthenticAMD                                                                                                                                                                                                                                           
Model name:            AMD EPYC 7713 64-Core Processor                                                                                                                                                                                                                         
CPU family:          25                                                                                                                                                                                                                                                       
Model:               1                                                                                                                                                                                                                                                        
Thread(s) per core:  2                                                                                                                                                                                                                                                        
Core(s) per socket:  64                                                                                                                                                                                                                                                       
Socket(s):           2                                                                                                                                                                                                                                                        
Stepping:            1                                                                                                                                                                                                                                                       
Frequency boost:     enabled                                                                                                                                                                                                                                                  
CPU max MHz:         3720.7029                                                                                                                                                                                                                                                
CPU min MHz:         1500.0000                                                                                                                                                                                                                                                
BogoMIPS:            4000.14                                                                                                                                                                                                                                                 
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3                          fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 c                         dp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf x                         saveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca           
Virtualization features:                                                                                                                                                                                                                                                        
Virtualization:        AMD-V                                                                                                                                                                                                                                                
Caches (sum of all):                                                                                                                                                                                                                                                           
L1d:                   4 MiB (128 instances)                                                                                                                                                                                                                                 
L1i:                   4 MiB (128 instances)                                                                                                                                                                                                                                 
L2:                    64 MiB (128 instances)                                                                                                                                                                                                                                 
L3:                    512 MiB (16 instances)                                                                                                                                                                                                                              
NUMA:                                                                                                                                                                                                                                                                           
NUMA node(s):          2                                                                                                                                                                                                                                                      
NUMA node0 CPU(s):     0-63,128-191                                                                                                                                                                                                                                          
NUMA node1 CPU(s):     64-127,192-255                                                                                                                                                                                                                                      
Vulnerabilities:                                                                                                                                                                                                                                                                
Gather data sampling:  Not affected                                                                                                                                                                                                                                          
Itlb multihit:         Not affected                                                                                                                                                                                                                                          
L1tf:                  Not affected                                                                                                                                                                                                                                           
Mds:                   Not affected                                                                                                                                                                                                                                           
Meltdown:              Not affected                                                                                                                                                                                                                                           
Mmio stale data:       Not affected                                                                                                                                                                                                                                           
Retbleed:              Not affected                                                                                                                                                                                                                                           
Spec rstack overflow:  Mitigation; safe RET                                                                                                                                                                                                                                   
Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp                                                                                                                                                                                   
Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization                                                                                                                                                                                   
Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected                                                                                                                                              
Srbds:                 Not affected                                                                                                                                                                                                                                           
Tsx async abort:       Not affected

自由的：

                total        used        free      shared  buff/cache   available                                                                                                                                                                                              
Mem:       263896372     4645244   256639004       60716     2612124   257540276                                                                                                                                                                                              
Swap:      268435452           0   268435452

tensorflow 版本：2.15 CUDA：12.2

ulimit -a 的输出：

core file size              (blocks, -c) 0                                                                                                                                                                                                                                   
data seg size               (kbytes, -d) unlimited                                                                                                                                                                                                                           
scheduling priority                 (-e) 0                                                                                                                                                                                                                                   
file size                   (blocks, -f) unlimited                                                                                                                                                                                                                           
pending signals                     (-i) 1030047                                                                                                                                                                                                                              
max locked memory           (kbytes, -l) 32987044                                                                                                                                                                                                                             
max memory size             (kbytes, -m) unlimited                                                                                                                                                                                                                           
open files                          (-n) 1024                                                                                                                                                                                                                                
pipe size                (512 bytes, -p) 8                                                                                                                                                                                                                                    
POSIX message queues         (bytes, -q) 819200                                                                                                                                                                                                                               
real-time priority                  (-r) 0                                                                                                                                                                                                                                   
stack size                  (kbytes, -s) 1024                                                                                                                                                                                                                                 
cpu time                   (seconds, -t) unlimited                                                                                                                                                                                                                            
max user processes                  (-u) 1030047                                                                                                                                                                                                                              
virtual memory              (kbytes, -v) unlimited                                                                                                                                                                                                                           
file locks                          (-x) unlimited

相同的代码在性能较弱的机器上运行没有问题，崩溃前的 top/free 输出也没有什么异常；dmesg 也是如此。在之前的 tensorflow 版本（<= 2.10）上，有问题的机器也没有问题。

我还尝试通过减少批量大小、仅使用单个 gpu 以及减少数据集的并行处理来减少负载；但没有任何显著的变化。

我目前没有主意，希望能得到一些帮助。是否可以记录引发 EAGAIN 的确切原因？

相关内容