Kubuntu 20.04 中随机冻结

Kubuntu 20.04 中随机冻结

这是我第一次在我的机器上遇到 Ubuntu 问题,最近我将我的电脑的 SSD 换成了全新的,它在 Windows 下运行得很好,而且固件也是最新的。

硬件

  • 金士顿 A200 NVME 500Gb(BTRS 和 XFS)
  • 混合显卡(Intel HD 530、NVIDIA GeForce GTX 950M)

软件

  • Nvidia 驱动程序 440(来自官方存储库,Prime Profile:按需)
  • Cuda 驱动程序(来自官方存储库)
  • Linux 内核 5.4.0-42-generic(已启用安全启动)

有时,我正在使用笔记本电脑,但 Kwin 停止工作,我无法打开应用程序启动器,但我可以通过 Alt + Tab 键更改窗口,但几秒钟后,屏幕完全冻结,我无法控制鼠标,温度开始升高,我无法切换到另一个控制台来检查错误(Control + Alt + F2),我只能使用 Magic SysRq 键 + REISUB 重新启动我的电脑。

我的系统的相关信息:

BIOS 版本

sudo dmidecode -s bios-version
E5CN63WW

RAM 和 SWAP 数据:

free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       3,9Gi       7,0Gi       1,3Gi       4,5Gi        10Gi
Swap:         3,8Gi       1,8Gi       2,0Gi

Swapiness

sysctl vm.swappiness
vm.swappiness = 60

系统日志journalctl -k -b -1(对我来说)没有显示任何相关信息,但我将以下带有警告或警报的消息附加在上面,以防我忘记某些内容

第一份日志

aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: IRQ 125: no longer affine to CPU1
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: IRQ 140: no longer affine to CPU4
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: IRQ 124: no longer affine to CPU6
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: IRQ 128: no longer affine to CPU6
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: IRQ 138: no longer affine to CPU7
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI: button: The lid device is not compliant to SW_LID.
aug 11 20:49:22 josejacomeb-Lenovo-ideapad-700-15ISK kernel: iwlwifi 0000:02:00.0: FW already configured (0) - re-configuring
aug 11 20:49:23 josejacomeb-Lenovo-ideapad-700-15ISK kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
aug 11 20:49:29 josejacomeb-Lenovo-ideapad-700-15ISK kernel: kauditd_printk_skb: 43 callbacks suppressed
aug 11 20:49:55 josejacomeb-Lenovo-ideapad-700-15ISK kernel: xfs filesystem being remounted at /run/systemd/unit-root/var/cache/private/fwupdmgr supports timestamps until 2038 (0x7fffffff)

第二份日志

aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: [Firmware Bug]: TPM Final Events table missing or invalid
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel:  #5 #6 #7
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0._PPC], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0._PCT], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0._PSS], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0.LPSS], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0.TPSS], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0.PSDF], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0._PSD], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0.HPSD], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI BIOS Error (bug): Failure creating named object [\_PR.CPU0.SPSD], AE_ALREADY_EXISTS (20190816/dswload2-326)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20190816/psobject-220)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform MSFT0101:00: failed to claim resource 1: [mem 0xfed40000-0xfed40fff]
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: acpi MSFT0101:00: platform device creation failed: -16
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: usb: port power management may be unreliable
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: nvme nvme0: missing or invalid SUBNQN field.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: xfs filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: asus_wmi: ASUS Management GUID not found
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: uvcvideo 1-5:1.0: Entity type for entity Realtek Extended Controls Unit was not initialized!
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: uvcvideo 1-5:1.0: Entity type for entity Extension 4 was not initialized!
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: uvcvideo 1-5:1.0: Entity type for entity Processing 2 was not initialized!
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: uvcvideo 1-5:1.0: Entity type for entity Camera 1 was not initialized!
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: nvidia: loading out-of-tree module taints kernel.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: nvidia: module license 'NVIDIA' taints kernel.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: Disabling lock debugging due to kernel taint
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: thermal thermal_zone3: failed to read out thermal zone (-61)
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.100  Fri May 29 08:45:51 UTC 2020
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
aug 11 21:02:01 josejacomeb-Lenovo-ideapad-700-15ISK kernel: iwlwifi 0000:02:00.0: FW already configured (0) - re-configuring
aug 11 21:02:01 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)

第三个日志

aug 11 21:44:10 josejacomeb-Lenovo-ideapad-700-15ISK kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.100  Fri May 29 08:45:51 UTC 2020
aug 11 21:44:10 josejacomeb-Lenovo-ideapad-700-15ISK kernel: thermal thermal_zone3: failed to read out thermal zone (-61)
aug 11 21:44:10 josejacomeb-Lenovo-ideapad-700-15ISK kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
aug 11 21:44:13 josejacomeb-Lenovo-ideapad-700-15ISK kernel: iwlwifi 0000:02:00.0: FW already configured (0) - re-configuring
aug 11 21:44:13 josejacomeb-Lenovo-ideapad-700-15ISK kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
aug 11 22:21:25 josejacomeb-Lenovo-ideapad-700-15ISK kernel: iwlwifi 0000:02:00.0: FW already configured (0) - re-configuring
aug 11 22:21:26 josejacomeb-Lenovo-ideapad-700-15ISK kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
aug 11 22:22:31 josejacomeb-Lenovo-ideapad-700-15ISK kernel: kauditd_printk_skb: 37 callbacks suppressed

更新

我重新硬安装了带有 EXT4 分区的 Kubuntu 20.04.1,似乎是 SSD 错误,新信息如下:

  • nvme0n1p5 /分区
  • nvme0n1p4 /home 分区 当我使用我的电脑时,它会随机发生,并且计算机完全冻结。
[ 3378.408344] systemd-journald (423): Failed to write entry (22 items, 780 bytes), ignoring: Read-only 
[ 3378.408611] systemd-journald [423] : Failed to write entry (22 items, 769 bytes), ignoring: Read-only 

另一个有关冻结错误的日志。

[ 827214225 EXT4-fs error (device nvme0n1p5): __ext4_find_entry:1531: inode #3407921: comm gmain: reading directory lblock 0
[ 827.214749] EXT4-fs error (device nvme0n1p5): __ext4_find_entry:1531: inode #3407921: conn gmain: reading directory lblock 0 
[ 827.214764] EXT4-fs error (device nvme0n1p5): __ext4_find_entry:1531: inode #3407921: comm gmain: reading directory lblock 0

有时当我关闭笔记本电脑时,会出现此错误

[ 16918.166564] systemd-shutdown [1]: Remounting '/' timed out. issuing SIGKILL to PID 11240.
[ 16982.141788] nvme nvme0: Device not ready: aborting reset
[ 16982.143784] nvme : Removing after probe failure status: -19

更新 2

使用 Kubuntu Live ISO,我执行了 fsck 测试,没有发现问题。

root@kubuntu:/home/kubuntu# fsck /dev/nvme0n1p3 
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
/dev/nvme0n1p3: clean, 257827/6111232 files, 8741020/24413952 blocks
root@kubuntu:/home/kubuntu# echo $?
0
root@kubuntu:/home/kubuntu# fsck /dev/nvme0n1p5
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
/dev/nvme0n1p5: clean, 754959/6447104 files, 10749435/25785856 blocks
root@kubuntu:/home/kubuntu# echo $?
0

重启时出现问题

nvme nvme0: Device not ready; aborting reset
nvme nvme0: Abort status: 0x371
nvme nvme0: Abort status: 0x371
nvme nvme0: Abort status: 0x371
Remounting '/' timed out, issuing SIGKILL to PID 7544.

SMART分析如下:

sudo smartctl -i /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-42-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M8500G
Serial Number:                      50026B7683BC98CE
Firmware Version:                   S5Z42105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500.107.862.016 [500 GB]
Namespace 1 Utilization:            142.133.460.992 [142 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 683bc98ce5
Local Time is:                      Wed Aug 26 23:49:45 2020 CEST


sudo smartctl -a /dev/nvme0         
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-42-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M8500G
Serial Number:                      50026B7683BC98CE
Firmware Version:                   S5Z42105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500.107.862.016 [500 GB]
Namespace 1 Utilization:            142.114.676.736 [142 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 683bc98ce5
Local Time is:                      Wed Aug 26 23:51:50 2020 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3.966.522 [2,03 TB]
Data Units Written:                 6.036.943 [3,09 TB]
Host Read Commands:                 38.899.250
Host Write Commands:                46.064.389
Controller Busy Time:               601
Power Cycles:                       390
Power On Hours:                     241
Unsafe Shutdowns:                   160
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   7
Thermal Temp. 1 Total Time:         24

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

SSD 固件

谢谢阅读。我做错了什么?任何评论都非常感谢!

问候

答案1

问题出在 SSD 功能上,自主电源状态转换 (APST) 导致冻结。为了缓解此问题,在他们发布修复程序之前,请nvme_core.default_ps_max_latency_us=0GRUB_CMDLINE_LINUX_DEFAULT选项中包含此行。例如:

GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0"
GRUB_CMDLINE_LINUX=""

答案2

BIOS

您的 BIOS 当前版本为 E5CN63WW。

骨髓增生异常综合征

您有 MDS 和 TAA 错误:

aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.

内核命令行上的缓解控制

内核命令行允许使用选项“mds=”在启动时控制 MDS 缓解措施。此选项的有效参数为:

full    

如果 CPU 存在漏洞,请启用所有可用的 MDS 漏洞缓解措施,在退出用户空间和进入虚拟机时清除 CPU 缓冲区。如果启用了 SMT,空闲转换也会受到保护。

它不会自动禁用 SMT。

full,nosmt

与 mds=full 相同,在易受攻击的 CPU 上禁用 SMT。这是完整的缓解措施。

off

完全禁用 MDS 缓解措施。


sudo -H gedit /etc/default/grub

改变:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

到:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash mds=full,nosmt"

保存文件并退出gedit

sudo update-grub

reboot

注意:请理解,在多 CPU 或多核配置下,性能会受到巨大影响。

注意:如果性能下降太大,请mds=full尝试mds=full,nosmt

NVMe

Kingston A200 NVME 500Gb

您可能遇到了固件问题:

aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: nvme nvme0: missing or invalid SUBNQN field.

请访问制造商的网站并检查是否有更新的固件。

可信平台管理

您遇到 TPM 错误:

aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: platform MSFT0101:00: failed to claim resource 1: [mem 0xfed40000-0xfed40fff]
aug 11 21:01:58 josejacomeb-Lenovo-ideapad-700-15ISK kernel: acpi MSFT0101:00: platform device creation failed: -16

检查以确保您Software Updates是最新的,并且您正在运行最新的内核。

检查 BIOS 中的 TPM 设置,并尽可能禁用 TPM。

记忆

您的 swap 和 vm.swappiness 设置看起来不错。

https://www.memtest86.com/并免费下载/运行它们memtest来测试你的记忆力。至少完成一次所有 4/4 测试以确认记忆力良好。这可能需要几个小时才能完成。

英伟达

您使用的是 Nvidia 驱动程序 440。现在有更新的版本 450.57,您可以下载这里

在此处输入图片描述

在此处输入图片描述

相关内容