重新安装后加载 NVIDIA 驱动程序和 CUDA 时出错

2024-6-10 • tag-icon

我有一台计算机，其特点是：

系统：Ubuntu 14.04
显卡：NVIDIA GTX1080ti

大约一年前，我安装了系统，然后在这台电脑上安装了带有 NVIDIA 驱动程序的 CUDA8.0。GPU 和 CUDA 一直正常工作，直到今天我尝试安装更高版本的 CUDA。

由于某些原因，我尝试安装 CUDA10.0 来替代当前安装的 CUDA8.0。首先，我使用卸载了旧驱动程序nvidia-uninstall。然后使用卸载了旧 CUDA /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl。之后，我使用从下载的运行文件安装程序安装了 CUDA10.0 和新驱动程序这一页。但是安装失败了。经过多次调试失败后，我放弃了，卸载了新的驱动程序和新的 CUDA，并使用从下载的运行文件安装程序重新安装 CUDA8.0这一页。安装成功。但我无法再启动有关 CUDA 的任何内容，包括和pycuda。所有这些软件包都报告说它们找不到 GPU 设备。pyopencltensorflow

更新：

我尝试过通过、和卸载所有 NVIDIA 组件，但sudo apt-get --purge remove nvidia-*问题仍然存在，错误报告和系统日志也变得不一样了，以下是当前的系统日志：nvidia-uninstalluninstall_cuda_8.0.pl

以下是我的一些系统日志：

在 python CLI 中，pycuda失败：

Python 2.7.6 (default, Nov 23 2017, 15:49:48) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
>>>

nvidia-smi报告：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ERR!                Off  | 0000:01:00.0      On |                  N/A |
| 28%   52C    P8    15W / 300W |     43MiB / 11168MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1868    G   /usr/lib/xorg/Xorg                              40MiB |
+-----------------------------------------------------------------------------+

dmesg | grep nvidia报告：

[    2.370841] nvidia: loading out-of-tree module taints kernel.
[    2.370844] nvidia: module license 'NVIDIA' taints kernel.
[    2.374116] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    2.380809] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[    2.383631] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  375.26  Thu Dec  8 18:04:14 PST 2016
[    2.385803] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    2.717844] init: nvidia-prime main process (1094) terminated with status 127
[    7.447032] nvidia-modeset: Allocated GPU:0 (GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194) @ PCI:0000:01:00.0
[   72.737634] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 241

nvidia-smi -a报告（注意该Product Name列是Unknown Error）：

==============NVSMI LOG==============

Timestamp                           : Thu Sep 27 10:16:41 2018
Driver Version                      : 375.26

Attached GPUs                       : 1
GPU 0000:01:00.0
    Product Name                    : Unknown Error
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-3727ccd9-f1fc-78c9-f908-5e1edf205194
    Minor Number                    : 0
    VBIOS Version                   : 86.02.40.00.2E
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x11117377
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11168 MiB
        Used                        : 43 MiB
        Free                        : 11125 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 2 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 43 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 14.68 W
        Power Limit                 : 300.00 W
        Default Power Limit         : 300.00 W
        Enforced Power Limit        : 300.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 330.00 W
    Clocks
        Graphics                    : 240 MHz
        SM                          : 240 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1999 MHz
        SM                          : 1999 MHz
        Memory                      : 5505 MHz
        Video                       : 1708 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 1868
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 40 MiB

我不知道问题出在哪里，也不知道该如何解决。有人能帮帮我吗？

答案1

尝试以 root 身份运行 cuda 程序。在类似情况下，我在 14.04 机器上也见过这种情况。它应该会在下次重启前修复这个问题。

答案1

相关内容