我整天都在尝试让这个 (v100) GPU 在新的 ubuntu VM 上运行。我尝试安装驱动程序并重新启动,还清除/卸载与 nvidia 相关的所有内容,但这些似乎都不起作用。
我特别运行了这个:
apt update;
apt install build-essential;
sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now
然后有时似乎 nvidia-smi 正在工作(截至撰写此问题时它还没有工作,所以我无法复制粘贴它工作时所说的内容)但是当它不工作时它会说这样的话:
(synthesis) miranda9@miranda9:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:00:06.0: Unknown Error
任何帮助都将受到赞赏。
请注意,我也无法访问虚拟机的 vmx 文件,因此这个问题和答案对我来说毫无用处/毫无意义:https://forums.developer.nvidia.com/t/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/46835
此外,我还尝试卸载 nivida 中的所有内容,然后重新安装:
sudo apt-get --purge remove "*nvidia*"
sudo /usr/bin/nvidia-uninstall
然后
apt update;
apt install build-essential;
sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now
但这似乎不起作用
更多信息以防有帮助:
(synthesis) miranda9@miranda9:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
还:
(synthesis) miranda9@miranda9:~$ python
Python 3.9.5 (default, Jun 4 2021, 12:28:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/miranda9/miniconda3/envs/synthesis/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at /opt/conda/conda-bld/pytorch_1623448238472/work/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
根据评论的要求:
# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
另一台虚拟机:
$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
我寻求帮助的资源:
- 我如何才能彻底卸载 nvidia 驱动程序?
- https://unix.stackexchange.com/questions/25663/how-to-get-the-version-of-my-nvidia-driver
- https://forums.developer.nvidia.com/t/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/46835
- UBUNTU 18.04 无法确定 GPU 0000:00:04.0 的设备句柄:未知错误
- https://stackoverflow.com/questions/10871412/resetting-gpu-and-driver-after-cuda-error
- 无法使 NVidia GPU 在 Ubuntu 18.04(华硕笔记本)上使用
- NVIDIA RTX 3080 GPU 无法与 Ubuntu 20.04、内核 5.8.0-50-generic 配合使用
- https://www.reddit.com/r/nvidia/comments/onorog/how_does_one_make_a_gpu_in_a_brand_new_ubuntu/
答案1
虚拟机模仿图形卡,因此对于客户系统来说,主机系统上拥有的本机卡应该是透明的。虚拟机用于“共享”资源 - 而不是可以直接访问其硬件的真实系统。因此,在主机系统上安装 Nvidia 驱动程序是没有意义的。您可以通过检查虚拟机中的当前驱动程序来检查这一点:
inxi -G
(在终端中执行)将向您显示 VM/oracle 驱动程序,而不是您的本机卡。
通过调整和技巧可能会获得高性能的图形输出,但虚拟机并不适合这样的工作......