介绍
嗨,我最近获得了一台带有 GPU 的远程机器的访问权限,我可以用它来加速我的研究。但是,我在设置它时遇到了很多麻烦。我经历了Nvidia Linux 安装指南几次,但仍然无法将 Cuda 与 PyTorch 一起使用(print(torch.cuda.is_available())
返回 false)。
细节
lspci | grep -i nvidia
00:0a.0 VGA 兼容控制器:NVIDIA Corporation GP102GL [Tesla P40](rev a1)
Linux 发行版是Ubuntu 16.04.6 LTS
问题
安装工具包后,我收到了成功消息和运行
nvcc --version
结果
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
然而,运行nvidia-smi
结果
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
然后我跑了
$ sudo /usr/bin/nvidia-uninstall
卸载驱动程序sudo ./NVIDIA-Linux-x86_64–410.104.run --no-x-check
但是,我在安装时遇到错误,这些是日志:
cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Jun 6 17:32:05 2019
installer version: 410.104
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--no-x-check
Unable to load: nvidia-installer ncurses v6 user interface
Using: nvidia-installer ncurses user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> Installing NVIDIA driver version 410.104.
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
-> Installing both new and classic TLS OpenGL libraries.
-> Installing classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.410.104"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.410.104"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.410.104"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.410.104"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
Checking for libglvnd installation.
Checking libGLdispatch...
Can't load library libGLdispatch.so.0: libGLdispatch.so.0: cannot open shared object file: No such file or directory
Will install libglvnd libraries.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (410.104):
executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
-> done.
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com
有人知道如何修复我的设置并正确安装驱动程序,以便我可以让 Pytorch 识别 cuda?
答案1
安装 pytorch 时不需要手动安装 CUDA,请参阅安装NVIDIA显示驱动后是否需要单独安装cuda?这是 pytorch 的第一句话和要点:
您不需要系统“CUDA Toolkit”
您可能会问为什么,简单的答案是 pytorch 安装了它自己的二进制文件,并且不关心安装了哪个系统 CUDA Toolkit。
如果你有可用的选项,只需使用官方https://pytorch.org/get-started/locally/找到正确的一行代码来在 Linux 上安装支持 cuda 的 pytorch。如果你的系统上没有这些选项,你需要从源代码安装,然后查看如何在 Windows 10 上使用 anaconda prompt 从源代码安装 pytorch(为旧 gpu 的已弃用的 CUDA cc 3.5 启用 cuda)?,首先。