我在 Azure 上使用 Nvidia VM 和 Ubuntu 20.04,我已经安装了 nvidia 和 cuda,但在运行我的程序时仍然显示未找到库
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000001:00:00.0 Off | Off |
| N/A 32C P0 25W / 70W | 0MiB / 16127MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
这是我看到的多个 cuda 库的错误:
Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2022-05-02 05:33:53.131224: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2022-05-02 05:33:53.131235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2022-05-02 05:33:55.738515: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1592] Cannot dlopen some GPU libraries.
我不确定这个错误是否是由于 GPG 密钥错误旋转还是其他原因,因为我也尝试过单独安装驱动程序,但它一直给出找不到驱动程序的错误。
我也尝试过:
sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cuda : Depends: cuda-11-6 (>= 11.6.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
我正在使用 tensorflow-gpu-2.1.3,因为这是我的程序的要求。
我也在 nvidia 论坛上发布了此内容。
答案1
我可以提供的解决方案选项:
使用不同的组合在 Google 上搜索您的问题。(此问题也可能与正确的“libcudnn.so.7”驱动程序路径有关,很可能是这样)与
libcudnn.so.7
错误转储中看到CUDA Version: 11.4
的结果nvidia-smi
真的一样吗?您必须确定这一点。使用现成的 nvidia-docker 插件,您可以发现 tensorflow、torch 等容器已准备就绪并正在运行。当前
nvidi-smi
转储表明这是可能的。可以使用不同的打包和配置系统(例如 anaconda 和 miniconda)获得更稳定、更完整的开发环境。如有必要,您可以使用环境获得 3 个不同的 TF 和 Cuda 版本。