在 Azure VM 上升级内核和 NVidia Tesla V100 驱动程序

在 Azure VM 上升级内核和 NVidia Tesla V100 驱动程序

我有一台搭载 NVidia Tesla V100 的 Azure VM,需要将内核从 6.2.0-1011-azure 升级到最新版本。VM 基于 NVidia 映像,以下是一些详细信息:

Operating system     Linux
Image publisher    nvidia
Image offer    ngc_azure_17_11
Image plan    ngc-base-version-23_09_1_gen2
VM generation    V2
VM architecture    x64

# uname -r
6.2.0-1011-azure

# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

据我了解,这将是 6.5.0.1017:

# rmadison linux-image-azure | grep jammy
 linux-image-azure | 5.15.0.1003.4           | jammy           | amd64, arm64
 linux-image-azure | 6.5.0.1017.17~22.04.1   | jammy-security  | amd64, arm64
 linux-image-azure | 6.5.0.1017.17~22.04.1   | jammy-updates   | amd64, arm64
 linux-image-azure | 6.5.0.1018.19~22.04.2   | jammy-proposed  | amd64, arm64

通过 apt update 和 apt upgrade 的简单方法没有任何作用:

# apt update
Hit:1 https://packages.microsoft.com/repos/azure-cli jammy InRelease
Hit:2 https://download.docker.com/linux/ubuntu jammy InRelease
Hit:3 http://azure.archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 http://azure.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://azure.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 http://azure.archive.ubuntu.com/ubuntu jammy-security InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
14 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: https://download.docker.com/linux/ubuntu/dists/jammy/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.

# apt list --upgradable
Listing... Done
ethtool/jammy-updates 1:5.16-1ubuntu0.1 amd64 [upgradable from: 1:5.16-1]
libnvidia-container-tools/unknown 1.14.6-1 amd64 [upgradable from: 1.14.1-1]
libnvidia-container1/unknown 1.14.6-1 amd64 [upgradable from: 1.14.1-1]
linux-cloud-tools-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-headers-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-image-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-tools-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-tools-common/jammy-updates,jammy-updates,jammy-security,jammy-security 5.15.0-101.111 all [upgradable from: 5.15.0-83.92]
nvidia-container-toolkit-base/unknown 1.14.6-1 amd64 [upgradable from: 1.13.5-1]
nvidia-container-toolkit/unknown 1.14.6-1 amd64 [upgradable from: 1.13.5-1]
nvidia-fabricmanager-535/jammy-updates,jammy-security 535.161.07-0ubuntu0.22.04.1 amd64 [upgradable from: 535.54.03-0ubuntu0.22.04.1]
python3-update-manager/jammy-updates,jammy-updates 1:22.04.19 all [upgradable from: 1:22.04.10]
ubuntu-advantage-tools/jammy-updates,jammy-updates 31.2~22.04 amd64 [upgradable from: 28.1~22.04]
update-manager-core/jammy-updates,jammy-updates 1:22.04.19 all [upgradable from: 1:22.04.10]

# apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
  hwdata ubuntu-pro-client ubuntu-pro-client-l10n
The following packages have been kept back:
  libnvidia-container-tools libnvidia-container1 linux-cloud-tools-azure linux-headers-azure linux-image-azure linux-tools-azure
  nvidia-container-toolkit nvidia-container-toolkit-base nvidia-fabricmanager-535
The following packages will be upgraded:
  ethtool linux-tools-common python3-update-manager ubuntu-advantage-tools update-manager-core
5 upgraded, 3 newly installed, 0 to remove and 9 not upgraded.
1 standard LTS security update
Need to get 816 kB of archives.
After this operation, 494 kB of additional disk space will be used.
Do you want to continue? [Y/n] y

......................................................

Setting up update-manager-core (1:22.04.19) ...
Processing triggers for man-db (2.10.2-1) ...
Scanning processes...
Scanning linux images...

Running kernel seems to be up-to-date.

No services need to be restarted.
No containers need to be restarted.
No user sessions are running outdated binaries.
No VM guests are running outdated hypervisor (qemu) binaries on this host.

仍然显示内核未升级:

# apt list --upgradable
Listing... Done
libnvidia-container-tools/unknown 1.14.6-1 amd64 [upgradable from: 1.14.1-1]
libnvidia-container1/unknown 1.14.6-1 amd64 [upgradable from: 1.14.1-1]
linux-cloud-tools-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-headers-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-image-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
linux-tools-azure/jammy-updates,jammy-security 6.5.0.1017.17~22.04.1 amd64 [upgradable from: 6.2.0.1011.11~22.04.1]
nvidia-container-toolkit-base/unknown 1.14.6-1 amd64 [upgradable from: 1.13.5-1]
nvidia-container-toolkit/unknown 1.14.6-1 amd64 [upgradable from: 1.13.5-1]
nvidia-fabricmanager-535/jammy-updates,jammy-security 535.161.07-0ubuntu0.22.04.1 amd64 [upgradable from: 535.54.03-0ubuntu0.22.04.1]

但我发现 aptitude 可以做这样的升级:

# aptitude full-upgrade
The following NEW packages will be installed:
  linux-azure-6.5-cloud-tools-6.5.0-1017{a} linux-azure-6.5-tools-6.5.0-1017{a} linux-cloud-tools-6.5.0-1017-azure{a}
  linux-tools-6.5.0-1017-azure{a}
The following packages will be upgraded:
  libnvidia-container-tools libnvidia-container1 linux-cloud-tools-azure linux-headers-azure{b} linux-image-azure{b} linux-tools-azure
  nvidia-container-toolkit nvidia-container-toolkit-base nvidia-fabricmanager-535
9 packages upgraded, 4 newly installed, 0 to remove and 14 not upgraded.
Need to get 13.9 MB of archives. After unpacking 27.8 MB will be used.
The following packages have unmet dependencies:
 linux-image-azure : Depends: linux-image-6.5.0-1017-azure but it is not installable
 linux-headers-azure : Depends: linux-headers-6.5.0-1017-azure but it is not installable
 linux-azure : Depends: linux-image-azure (= 6.2.0.1011.11~22.04.1) but 6.5.0.1017.17~22.04.1 is to be installed
               Depends: linux-headers-azure (= 6.2.0.1011.11~22.04.1) but 6.5.0.1017.17~22.04.1 is to be installed
               Depends: linux-tools-azure (= 6.2.0.1011.11~22.04.1) but 6.5.0.1017.17~22.04.1 is to be installed
               Depends: linux-cloud-tools-azure (= 6.2.0.1011.11~22.04.1) but 6.5.0.1017.17~22.04.1 is to be installed
The following actions will resolve these dependencies:

     Keep the following packages at their current version:
1)     linux-cloud-tools-azure [6.2.0.1011.11~22.04.1 (now)]
2)     linux-headers-azure [6.2.0.1011.11~22.04.1 (now)]
3)     linux-image-azure [6.2.0.1011.11~22.04.1 (now)]
4)     linux-tools-azure [6.2.0.1011.11~22.04.1 (now)]



Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

     Remove the following packages:
1)     linux-azure [6.2.0.1011.11~22.04.1 (now)]

     Keep the following packages at their current version:
2)     linux-headers-azure [6.2.0.1011.11~22.04.1 (now)]
3)     linux-image-azure [6.2.0.1011.11~22.04.1 (now)]



Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

     Remove the following packages:
1)     linux-azure [6.2.0.1011.11~22.04.1 (now)]
2)     linux-headers-azure [6.2.0.1011.11~22.04.1 (now)]

     Keep the following packages at their current version:
3)     linux-image-azure [6.2.0.1011.11~22.04.1 (now)]



Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

     Remove the following packages:
1)     linux-azure [6.2.0.1011.11~22.04.1 (now)]
2)     linux-image-azure [6.2.0.1011.11~22.04.1 (now)]

     Keep the following packages at their current version:
3)     linux-headers-azure [6.2.0.1011.11~22.04.1 (now)]



Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

     Remove the following packages:
1)     linux-azure [6.2.0.1011.11~22.04.1 (now)]
2)     linux-headers-azure [6.2.0.1011.11~22.04.1 (now)]
3)     linux-image-azure [6.2.0.1011.11~22.04.1 (now)]



Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

     Install the following packages:
1)     linux-azure-6.5-headers-6.5.0-1017 [6.5.0-1017.17~22.04.1 (jammy-security, jammy-updates)]
2)     linux-headers-6.5.0-1017-azure [6.5.0-1017.17~22.04.1 (jammy-security, jammy-updates)]
3)     linux-image-6.5.0-1017-azure [6.5.0-1017.17~22.04.1 (jammy-security, jammy-updates)]
4)     linux-modules-6.5.0-1017-azure [6.5.0-1017.17~22.04.1 (jammy-security, jammy-updates)]

     Upgrade the following packages:
5)     linux-azure [6.2.0.1011.11~22.04.1 (now) -> 6.5.0.1017.17~22.04.1 (jammy-security, jammy-updates)]



Accept this solution? [Y/n/q/?] y
The following NEW packages will be installed:
  linux-azure-6.5-cloud-tools-6.5.0-1017{a} linux-azure-6.5-headers-6.5.0-1017{a} linux-azure-6.5-tools-6.5.0-1017{a}
  linux-cloud-tools-6.5.0-1017-azure{a} linux-headers-6.5.0-1017-azure{a} linux-image-6.5.0-1017-azure{a} linux-modules-6.5.0-1017-azure{a}
  linux-tools-6.5.0-1017-azure{a}
The following packages will be upgraded:
  libnvidia-container-tools libnvidia-container1 linux-azure linux-cloud-tools-azure linux-headers-azure linux-image-azure linux-tools-azure
  nvidia-container-toolkit nvidia-container-toolkit-base nvidia-fabricmanager-535
10 packages upgraded, 8 newly installed, 0 to remove and 13 not upgraded.
Need to get 67.3 MB of archives. After unpacking 278 MB will be used.
Do you want to continue? [Y/n/?] y
Get: 1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libnvidia-container1 1.14.6-1 [926 kB]
Get: 2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libnvidia-container-tools 1.14.6-1 [21.1 kB]
Get: 3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  nvidia-container-toolkit 1.14.6-1 [923 kB]
Get: 4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  nvidia-container-toolkit-base 1.14.6-1 [2354 kB]
Get: 5 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-modules-6.5.0-1017-azure amd64 6.5.0-1017.17~22.04.1 [23.5 MB]
Get: 6 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-image-6.5.0-1017-azure amd64 6.5.0-1017.17~22.04.1 [13.4 MB]
Get: 7 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-azure amd64 6.5.0.1017.17~22.04.1 [1756 B]
Get: 8 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-image-azure amd64 6.5.0.1017.17~22.04.1 [2598 B]
Get: 9 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-azure-6.5-headers-6.5.0-1017 all 6.5.0-1017.17~22.04.1 [13.3 MB]
Get: 10 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-headers-6.5.0-1017-azure amd64 6.5.0-1017.17~22.04.1 [3245 kB]
Get: 11 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-headers-azure amd64 6.5.0.1017.17~22.04.1 [2492 B]
Get: 12 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-azure-6.5-tools-6.5.0-1017 amd64 6.5.0-1017.17~22.04.1 [7633 kB]
Get: 13 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-tools-6.5.0-1017-azure amd64 6.5.0-1017.17~22.04.1 [1778 B]
Get: 14 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-tools-azure amd64 6.5.0.1017.17~22.04.1 [2508 B]
Get: 15 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-azure-6.5-cloud-tools-6.5.0-1017 amd64 6.5.0-1017.17~22.04.1 [131 kB]
Get: 16 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-cloud-tools-6.5.0-1017-azure amd64 6.5.0-1017.17~22.04.1 [1704 B]
Get: 17 http://azure.archive.ubuntu.com/ubuntu jammy-updates/main amd64 linux-cloud-tools-azure amd64 6.5.0.1017.17~22.04.1 [2522 B]
Get: 18 http://azure.archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 nvidia-fabricmanager-535 amd64 535.161.07-0ubuntu0.22.04.1 [1861 kB]
Fetched 67.3 MB in 2s (35.1 MB/s)
(Reading database ... 133576 files and directories currently installed.)
Preparing to unpack .../00-libnvidia-container1_1.14.6-1_amd64.deb ...
Unpacking libnvidia-container1:amd64 (1.14.6-1) over (1.14.1-1) ...
Preparing to unpack .../01-libnvidia-container-tools_1.14.6-1_amd64.deb ...
Unpacking libnvidia-container-tools (1.14.6-1) over (1.14.1-1) ...
Selecting previously unselected package linux-modules-6.5.0-1017-azure.
Preparing to unpack .../02-linux-modules-6.5.0-1017-azure_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-modules-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Selecting previously unselected package linux-image-6.5.0-1017-azure.
Preparing to unpack .../03-linux-image-6.5.0-1017-azure_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-image-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Preparing to unpack .../04-linux-azure_6.5.0.1017.17~22.04.1_amd64.deb ...
Unpacking linux-azure (6.5.0.1017.17~22.04.1) over (6.2.0.1011.11~22.04.1) ...
Preparing to unpack .../05-linux-image-azure_6.5.0.1017.17~22.04.1_amd64.deb ...
Unpacking linux-image-azure (6.5.0.1017.17~22.04.1) over (6.2.0.1011.11~22.04.1) ...
Selecting previously unselected package linux-azure-6.5-headers-6.5.0-1017.
Preparing to unpack .../06-linux-azure-6.5-headers-6.5.0-1017_6.5.0-1017.17~22.04.1_all.deb ...
Unpacking linux-azure-6.5-headers-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Selecting previously unselected package linux-headers-6.5.0-1017-azure.
Preparing to unpack .../07-linux-headers-6.5.0-1017-azure_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-headers-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Preparing to unpack .../08-linux-headers-azure_6.5.0.1017.17~22.04.1_amd64.deb ...
Unpacking linux-headers-azure (6.5.0.1017.17~22.04.1) over (6.2.0.1011.11~22.04.1) ...
Selecting previously unselected package linux-azure-6.5-tools-6.5.0-1017.
Preparing to unpack .../09-linux-azure-6.5-tools-6.5.0-1017_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-azure-6.5-tools-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Selecting previously unselected package linux-tools-6.5.0-1017-azure.
Preparing to unpack .../10-linux-tools-6.5.0-1017-azure_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-tools-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Preparing to unpack .../11-linux-tools-azure_6.5.0.1017.17~22.04.1_amd64.deb ...
Unpacking linux-tools-azure (6.5.0.1017.17~22.04.1) over (6.2.0.1011.11~22.04.1) ...
Selecting previously unselected package linux-azure-6.5-cloud-tools-6.5.0-1017.
Preparing to unpack .../12-linux-azure-6.5-cloud-tools-6.5.0-1017_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-azure-6.5-cloud-tools-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Selecting previously unselected package linux-cloud-tools-6.5.0-1017-azure.
Preparing to unpack .../13-linux-cloud-tools-6.5.0-1017-azure_6.5.0-1017.17~22.04.1_amd64.deb ...
Unpacking linux-cloud-tools-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Preparing to unpack .../14-linux-cloud-tools-azure_6.5.0.1017.17~22.04.1_amd64.deb ...
Unpacking linux-cloud-tools-azure (6.5.0.1017.17~22.04.1) over (6.2.0.1011.11~22.04.1) ...
Preparing to unpack .../15-nvidia-container-toolkit_1.14.6-1_amd64.deb ...
Unpacking nvidia-container-toolkit (1.14.6-1) over (1.13.5-1) ...
Preparing to unpack .../16-nvidia-container-toolkit-base_1.14.6-1_amd64.deb ...
Unpacking nvidia-container-toolkit-base (1.14.6-1) over (1.13.5-1) ...
dpkg: warning: unable to delete old directory '/etc/nvidia-container-runtime': Directory not empty
Preparing to unpack .../17-nvidia-fabricmanager-535_535.161.07-0ubuntu0.22.04.1_amd64.deb ...
Unpacking nvidia-fabricmanager-535 (535.161.07-0ubuntu0.22.04.1) over (535.54.03-0ubuntu0.22.04.1) ...
Setting up linux-azure-6.5-tools-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Setting up linux-azure-6.5-headers-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Setting up linux-modules-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Setting up nvidia-container-toolkit-base (1.14.6-1) ...
Setting up linux-tools-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Setting up libnvidia-container1:amd64 (1.14.6-1) ...
Setting up nvidia-fabricmanager-535 (535.161.07-0ubuntu0.22.04.1) ...
nvidia-fabricmanager.service is a disabled or a static unit not running, not starting it.
Setting up linux-azure-6.5-cloud-tools-6.5.0-1017 (6.5.0-1017.17~22.04.1) ...
Setting up linux-tools-azure (6.5.0.1017.17~22.04.1) ...
Setting up libnvidia-container-tools (1.14.6-1) ...
Setting up linux-headers-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
/etc/kernel/header_postinst.d/dkms:
 * dkms: running auto installation service for kernel 6.5.0-1017-azure
   ...done.
Setting up linux-image-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
I: /boot/vmlinuz is now a symlink to vmlinuz-6.5.0-1017-azure
I: /boot/initrd.img is now a symlink to initrd.img-6.5.0-1017-azure
Setting up linux-cloud-tools-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
Setting up linux-headers-azure (6.5.0.1017.17~22.04.1) ...
Setting up linux-image-azure (6.5.0.1017.17~22.04.1) ...
Setting up nvidia-container-toolkit (1.14.6-1) ...
Setting up linux-cloud-tools-azure (6.5.0.1017.17~22.04.1) ...
Setting up linux-azure (6.5.0.1017.17~22.04.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.6) ...
Processing triggers for linux-image-6.5.0-1017-azure (6.5.0-1017.17~22.04.1) ...
/etc/kernel/postinst.d/dkms:
 * dkms: running auto installation service for kernel 6.5.0-1017-azure
   ...done.
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-6.5.0-1017-azure
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/40-force-partuuid.cfg'
Sourcing file `/etc/default/grub.d/50-cloudimg-settings.cfg'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
GRUB_FORCE_PARTUUID is set, will attempt initrdless boot
Found linux image: /boot/vmlinuz-6.5.0-1017-azure
Found initrd image: /boot/initrd.img-6.5.0-1017-azure
Found linux image: /boot/vmlinuz-6.2.0-1011-azure
Found initrd image: /boot/initrd.img-6.2.0-1011-azure
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done
Scanning processes...
Scanning linux images...

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

Current status: 0 (-10) upgradable.

执行此命令后我可以看到所有软件包都已升级:

# apt list --upgradable
Listing... Done

新的内核版本是:

$ uname -a
Linux ************** 6.5.0-1017-azure #17~22.04.1-Ubuntu SMP Sat Mar  9 04:50:38 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

但重新启动虚拟机后,发现 NVidia 驱动程序不再起作用:

===============================================================
--------         trying to restart the session         --------
===============================================================

Expanded Security Maintenance for Applications is not enabled.

0 updates can be applied immediately.

Enable ESM Apps to receive additional future security updates.
See https://ubuntu.com/esm or run: sudo pro status

======================================================================================
Automatic installation of updates:      Enabled (All main updates)
======================================================================================


The following Azure CLI version has been pre-installed. Begin using the Azure CLI by first configuring your credentials using az login
{
  "azure-cli": "2.59.0",
  "azure-cli-core": "2.59.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": {}
}


Welcome to the NVIDIA GPU Cloud image.  This image provides an optimized
environment for running the deep learning and HPC containers from the
NVIDIA GPU Cloud Container Registry.  Many NGC containers are freely
available.  However, some NGC containers require that you log in with
a valid NGC API key in order to access them.  This is indicated by a
"pull access denied for xyz ..." or "Get xyz: unauthorized: ..." error
message from the daemon.

Documentation on using this image and accessing the NVIDIA GPU Cloud
Container Registry can be found at
  http://docs.nvidia.com/ngc/index.html

Last login: Tue Apr  2 09:59:32 2024 from 17*************
Installing drivers ...
modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.5.0-1017-azure
Install complete

$ ls -la /lib/modules
total 16
drwxr-xr-x  4 root root 4096 Apr  2 10:08 .
drwxr-xr-x 91 root root 4096 Apr  2 10:09 ..
drwxr-xr-x  5 root root 4096 Apr  2 09:57 6.2.0-1011-azure
drwxr-xr-x  5 root root 4096 Apr  2 10:09 6.5.0-1017-azure


$ ls -la /lib/modules/6.5.0-1017-azure
total 1240
drwxr-xr-x  5 root root   4096 Apr  2 10:09 .
drwxr-xr-x  4 root root   4096 Apr  2 10:08 ..
lrwxrwxrwx  1 root root     39 Mar  9 03:35 build -> /usr/src/linux-headers-6.5.0-1017-azure
drwxr-xr-x  2 root root   4096 Mar  9 03:35 initrd
drwxr-xr-x 13 root root   4096 Apr  2 10:08 kernel
-rw-r--r--  1 root root  96434 Apr  2 10:09 modules.alias
-rw-r--r--  1 root root 102556 Apr  2 10:09 modules.alias.bin
-rw-r--r--  1 root root   8977 Mar  9 03:35 modules.builtin
-rw-r--r--  1 root root  27445 Apr  2 10:09 modules.builtin.alias.bin
-rw-r--r--  1 root root  11156 Apr  2 10:09 modules.builtin.bin
-rw-r--r--  1 root root  80140 Mar  9 03:35 modules.builtin.modinfo
-rw-r--r--  1 root root  75461 Apr  2 10:09 modules.dep
-rw-r--r--  1 root root 108685 Apr  2 10:09 modules.dep.bin
-rw-r--r--  1 root root    240 Apr  2 10:09 modules.devname
-rw-r--r--  1 root root 103585 Mar  9 03:35 modules.order
-rw-r--r--  1 root root    940 Apr  2 10:09 modules.softdep
-rw-r--r--  1 root root 276670 Apr  2 10:09 modules.symbols
-rw-r--r--  1 root root 324428 Apr  2 10:09 modules.symbols.bin
drwxr-xr-x  3 root root   4096 Apr  2 10:08 vdso


$ ls -la /lib/modules/6.5.0-1017-azure/kernel/
total 52
drwxr-xr-x 13 root root 4096 Apr  2 10:08 .
drwxr-xr-x  5 root root 4096 Apr  2 10:09 ..
drwxr-xr-x  3 root root 4096 Apr  2 10:08 arch
drwxr-xr-x  2 root root 4096 Apr  2 10:08 block
drwxr-xr-x  4 root root 4096 Apr  2 10:08 crypto
drwxr-xr-x 36 root root 4096 Apr  2 10:08 drivers
drwxr-xr-x 23 root root 4096 Apr  2 10:08 fs
drwxr-xr-x  9 root root 4096 Apr  2 10:08 lib
drwxr-xr-x 40 root root 4096 Apr  2 10:08 net
drwxr-xr-x  3 root root 4096 Apr  2 10:08 ubuntu
drwxr-xr-x  2 root root 4096 Apr  2 10:08 v4l2loopback
drwxr-xr-x  3 root root 4096 Apr  2 10:08 virt
drwxr-xr-x  2 root root 4096 Apr  2 10:08 zfs


$ du -hs /lib/modules/6.5.0-1017-azure
116M    /lib/modules/6.5.0-1017-azure

问题似乎是因为 6.5.0 内核的“内核”目录中没有包含 nvidia 显卡驱动程序的子目录。

但对于 6.2.0 内核,这些子目录如下:

$ ls -la /lib/modules/6.2.0-1011-azure/
total 1244
drwxr-xr-x  5 root root   4096 Apr  2 09:57 .
drwxr-xr-x  4 root root   4096 Apr  2 10:08 ..
lrwxrwxrwx  1 root root     39 Aug 23  2023 build -> /usr/src/linux-headers-6.2.0-1011-azure
drwxr-xr-x  2 root root   4096 Aug 23  2023 initrd
drwxr-xr-x 20 root root   4096 Apr  2 09:57 kernel
-rw-r--r--  1 root root  96591 Apr  2 09:57 modules.alias
-rw-r--r--  1 root root 103045 Apr  2 09:57 modules.alias.bin
-rw-r--r--  1 root root   9831 Aug 23  2023 modules.builtin
-rw-r--r--  1 root root  25717 Apr  2 09:57 modules.builtin.alias.bin
-rw-r--r--  1 root root  12955 Apr  2 09:57 modules.builtin.bin
-rw-r--r--  1 root root  78550 Aug 23  2023 modules.builtin.modinfo
-rw-r--r--  1 root root  77269 Apr  2 09:57 modules.dep
-rw-r--r--  1 root root 110485 Apr  2 09:57 modules.dep.bin
-rw-r--r--  1 root root    240 Apr  2 09:57 modules.devname
-rw-r--r--  1 root root 101308 Aug 23  2023 modules.order
-rw-r--r--  1 root root    652 Apr  2 09:57 modules.softdep
-rw-r--r--  1 root root 278685 Apr  2 09:57 modules.symbols
-rw-r--r--  1 root root 325721 Apr  2 09:57 modules.symbols.bin
drwxr-xr-x  3 root root   4096 Sep  8  2023 vdso

$ ls -la /lib/modules/6.2.0-1011-azure/kernel/
total 80
drwxr-xr-x 20 root root 4096 Apr  2 09:57 .
drwxr-xr-x  5 root root 4096 Apr  2 09:57 ..
drwxr-xr-x  3 root root 4096 Sep  8  2023 arch
drwxr-xr-x  2 root root 4096 Sep  8  2023 block
drwxr-xr-x  4 root root 4096 Sep  8  2023 crypto
drwxr-xr-x 35 root root 4096 Sep  8  2023 drivers
drwxr-xr-x 24 root root 4096 Sep  8  2023 fs
drwxr-xr-x  9 root root 4096 Sep  8  2023 lib
drwxr-xr-x 40 root root 4096 Sep  8  2023 net
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-390
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-450srv
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-470
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-470srv
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-525
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-525srv
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-535
drwxr-xr-x  3 root root 4096 Apr  2 09:57 nvidia-535srv
drwxr-xr-x  3 root root 4096 Sep  8  2023 ubuntu
drwxr-xr-x  3 root root 4096 Sep  8  2023 virt
drwxr-xr-x  2 root root 4096 Sep  8  2023 zfs

$ du -hs /lib/modules/6.2.0-1011-azure/
512M    /lib/modules/6.2.0-1011-azure/

我无法通过 apt update 和 apt install 找到该驱动程序。所以我的问题是 - 如何以正确的方式安装或升级该驱动程序?

相关内容