无法在 Ubuntu 20.04 上使用 CUDA 运行 Tensorflow 模型

无法在 Ubuntu 20.04 上使用 CUDA 运行 Tensorflow 模型

过去几天我一直在尝试安装 CUDA 以适应我的 Tensorflow CNN。目前已安装在我的计算机上(Ubuntu 20.04 LTS,RTX3060):

tensorflow-gpu 2.4

蟒蛇3.8.10

cuDNN 8.0

CUDA 11.0

nvidia-驱动程序-495

该驱动程序与 CUDA 11.0 一起安装。

当我拟合一个模型时,我可以看到我的 GPU 正在分配所有内存,但模型详细程度停留在:Epoch : 1/50并且永远不会进一步提升。

我尝试将驱动程序降级为 nvidia-driver-470,因为 495 尚未正式推出。此操作导致一切都停止工作:我的 GPU 在安装时不再分配,nvidia -smi不再工作,并且导入 tensorflow 现在返回:

Could not load dynamic library 'libcudart.so.11.0'; dlerror:

而以前的情况并非如此。

有谁知道这个问题可能出自哪里?

谢谢

编辑1:

重启后,导入 Tensorflow 返回:

tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2021-11-02 06:24:40.852786: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

目录 /usr/lib/cuda/include 和 /usr/lib/cuda/lib64 确实存在。

编辑2:

从此链接重新安装 cuda 后:https://askubuntu.com/a/1288405/231142

Tensorflow 导入工作并且没有返回任何问题。

EarlyStop=EarlyStopping(patience=10,restore_best_weights=True)
Reduce_LR=ReduceLROnPlateau(monitor='val_accuracy',verbose=2,factor=0.5,min_lr=0.00001)
model_check=ModelCheckpoint('model.hdf5',monitor='val_loss',verbose=1,save_best_only=True)
tensorbord=TensorBoard(log_dir='logs')
callback=[EarlyStop , Reduce_LR,model_check,tensorbord]

返回:

2021-11-02 20:09:55.607299: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:09:55.607335: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:09:55.608325: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-11-02 20:09:55.609026: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609320: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609372: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-11-02 20:09:55.609476: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-11-02 20:09:55.609527: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.

模型拟合开始并使用我所有的 GPU 和 CPU,但运行仍然很慢并返回:

2021-11-02 20:09:55.832301: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.269844: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:56.669900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.821919: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:57.065544: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/20
2021-11-02 20:09:59.868007: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
  1/137 [..............................] - ETA: 1:15:21 - loss: 0.7485 - accuracy: 0.38712021-11-02 20:10:30.404084: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:10:30.404114: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:10:30.404277: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.

该库可能有问题libcupti.so.11.2,但我暂时还没有发现。

相关内容