首先,我的系统是:
-AMD Threadripper 1950X
-Vega FE*2 + Radeon VII
-ubuntu 18.04, kernel 4.18.0-16-generic
升级后,我发现系统无法启动。我进入 grub 并尝试删除 quiet splash 以查看日志,不幸的是它每次都冻结在不同点,所有行都显示绿色“OK”。然后,我按照一些在线指南将“nomodeset”添加到 grub,这样系统就可以正常启动。但是,它不会在内核中加载 GPU,我再也无法在 clinfo 中看到它们,也无法使用它们。
新内核 4.18.0-16 于 03-07 安装,并于 03-11 重启,没有任何问题,因此我认为这不是原因。我尝试使用 autoremove 删除 rocm,但问题仍然存在(删除后,仍然只有 nomodeset 允许系统启动)。
以下是 2019-03-13 安装的升级列表,以防万一有任何可疑之处,不幸的是它很长。
Commandline: apt upgrade
Requested-By: sandbo (1000)
Upgrade: hsa-rocr-dev:amd64 (1.1.9-49-g39f1af5, 1.1.9-55-gbac2a9b),
libxcb-present-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
hsakmt-roct-dev:amd64 (1.0.9-111-gc65f2de, 1.0.9-121-g876627e),
libseccomp2:amd64 (2.3.1-2.1ubuntu4, 2.3.1-2.1ubuntu4.1),
hsakmt-roct:amd64 (1.0.9-111-gc65f2de, 1.0.9-121-g876627e),
virtinst:amd64 (1:1.5.1-0ubuntu1.1, 1:1.5.1-0ubuntu1.2),
libxcb-xfixes0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
rock-dkms:amd64 (2.1-96, 2.2-31),
rocm-opencl:amd64 (1.2.0-2019020110, 1.2.0-2019030702),
libsystemd0:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
libsystemd0:i386 (237-3ubuntu10.13, 237-3ubuntu10.15),
hip_base:amd64 (1.5.19025, 1.5.19055),
libxcb-present0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-present0:i386 (1.13-1, 1.13-2~ubuntu18.04),
hsa-ext-rocr-dev:amd64 (1.1.9-49-g39f1af5, 1.1.9-55-gbac2a9b),
libxcb-xfixes0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
rocrand:amd64 (1.8.2, 1.8.2), rocfft:amd64 (0.8.9.0, 0.9.0.0),
google-chrome-stable:amd64 (72.0.3626.121-1, 73.0.3683.75-1),
hcc:amd64 (1.3.19045, 1.3.19092),
udev:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
libxcb-shm0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-shm0:i386 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-randr0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-render0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-render0:i386 (1.13-1, 1.13-2~ubuntu18.04),
libxcb1-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libudev1:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
libudev1:i386 (237-3ubuntu10.13, 237-3ubuntu10.15),
comgr:amd64 (1.1.0, 1.1.0),
libtiff5:amd64 (4.0.9-5ubuntu0.1, 4.0.9-5ubuntu0.2),
libtiff5:i386 (4.0.9-5ubuntu0.1, 4.0.9-5ubuntu0.2),
libxcb-randr0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-dri3-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb1:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb1:i386 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-shape0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libnss-myhostname:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
libxcb-res0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
rocm-libs:amd64 (2.1.96, 2.2.31),
systemd-sysv:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
rocm-dev:amd64 (2.1.96, 2.2.31),
libxcb-xv0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
rocm-utils:amd64 (2.1.96, 2.2.31),
libpam-systemd:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
libxcb-render0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-shape0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
virt-manager:amd64 (1:1.5.1-0ubuntu1.1, 1:1.5.1-0ubuntu1.2),
systemd:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
hip_doc:amd64 (1.5.19025, 1.5.19055),
libnss-systemd:amd64 (237-3ubuntu10.13, 237-3ubuntu10.15),
miopen-hip:amd64 (1.7.1, 1.7.1),
rocm-device-libs:amd64 (0.0.1, 0.0.1),
hip_hcc:amd64 (1.5.19025, 1.5.19055),
rocm-opencl-dev:amd64 (1.2.0-2019020110, 1.2.0-2019030702),
hip_samples:amd64 (1.5.19025, 1.5.19055),
libxcb-sync-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
cxlactivitylogger:amd64 (5.6.7254, 5.6.7259),
libxcb-dri2-0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-glx0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-glx0:i386 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-glx0-dev:amd64 (1.13-1, 1.13-2~ubuntu18.04),
rocprofiler-dev:amd64 (1.0.0, 1.0.0),
libxcb-dri2-0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-dri2-0:i386 (1.13-1, 1.13-2~ubuntu18.04),
rocm-smi:amd64 (1.0.0-100-g3cacdb9, 1.0.0-102-gdb444a9),
libxcb-dri3-0:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-dri3-0:i386 (1.13-1, 1.13-2~ubuntu18.04),
rocm-dkms:amd64 (2.1.96, 2.2.31),
libxcb-xkb1:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-sync1:amd64 (1.13-1, 1.13-2~ubuntu18.04),
libxcb-sync1:i386 (1.13-1, 1.13-2~ubuntu18.04)
尝试了一些方法:-将默认 dm 更改为 lightdm - 不起作用 -禁用 wayland - 不起作用
由于升级后重启后就会发生这种情况,我相信硬件没有问题。(几天前我还在正常使用它们,负载很高)
答案1
它已被修复,这是 ROCm 和 AMD GPU 的硬件特定问题。 https://github.com/RadeonOpenCompute/ROCm/issues/735#issuecomment-473100963