我有 Ubuntu 22.04,从 20.04 升级而来,安装于 2020 年。此机器]1在内核 5.15.0-43 下运行良好。在最近更新(2023.11.28)后,它经常因 I/O 错误而随机崩溃。崩溃时它不会写入日志文件,因为 ssd 不可写(!)。这应该是固件或内核错误,因为内核 5.15.0-43 的 Ubuntu 22.04 可以运行来自 pendrive。Windows 11 在其分区中也能很好地工作。我做到了没有发现任何硬件问题. 根目录 / 安装在/dev/nvme0n1p5并且 /home 位于/dev/nvme0n1p6 我需要一些固件或内核程序员的帮助如何调试这个。有人能帮我解决这个问题吗?

错误 1

以及如下错误:

错误 2

$ cat /etc/*release

$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

$ udisksctl 状态

$ udisksctl status
MODEL                     REVISION  SERIAL               DEVICE
--------------------------------------------------------------------------
INTEL SSDPEKNW512G8       004C      BTNH05020T67512A     nvme0n1 

$ df -T

$ df -T
Filesystem     Type  1K-blocks      Used Available Use% Mounted on
tmpfs          tmpfs    780904      2232    778672   1% /run
/dev/nvme0n1p5 ext4   50080992  28573776  18930800  61% /
tmpfs          tmpfs   3904504    168276   3736228   5% /dev/shm
tmpfs          tmpfs      5120         4      5116   1% /run/lock
/dev/nvme0n1p6 ext4  334721912 207130268 110515420  66% /home
/dev/nvme0n1p1 vfat     262144     53548    208596  21% /boot/efi
tmpfs          tmpfs    780900       112    780788   1% /run/user/1000

sudo parted -l

$ sudo parted -l
Model: INTEL SSDPEKNW512G8 (nvme)
Disk /dev/nvme0n1: 512GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name                          Flags
 1      1049kB  274MB   273MB   fat32           EFI system partition          boot, esp
 2      274MB   290MB   16,8MB                  Microsoft reserved partition  msftres
 3      290MB   90,3GB  90,0GB                  Basic data partition          msftdata
 4      90,3GB  91,4GB  1156MB  ntfs                                          hidden, diag
 5      91,4GB  144GB   52,4GB  ext4
 6      144GB   493GB   349GB   ext4
 7      493GB   511GB   17,6GB  linux-swap(v1)  swap                          swap
 8      512GB   512GB   210MB   fat32           Basic data partition          hidden, diag

uname -a

$ uname -a
Linux bkb 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ sudo dpkg --list|grep linux-image

$ sudo dpkg --list|grep linux-image
hi  linux-image-5.15.0-43-generic              5.15.0-43.46                                amd64        Signed kernel image generic
ii  linux-image-6.2.0-37-generic               6.2.0-37.38~22.04.1                         amd64        Signed kernel image generic
ii  linux-image-generic-hwe-22.04              6.2.0.37.38~22.04.15                        amd64        Generic Linux kernel image

$ sudo smartctl -a /dev/nvme0n1p6

$ sudo smartctl -a /dev/nvme0n1p6
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.2.0-37-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW512G8
Serial Number:                      BTNH05020T67512A
Firmware Version:                   004C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512.110.190.592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Dec  4 22:16:07 2023 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.00W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000    5000
 4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    33.133.223 [16,9 TB]
Data Units Written:                 29.155.023 [14,9 TB]
Host Read Commands:                 564.528.828
Host Write Commands:                395.005.395
Controller Busy Time:               9.251
Power Cycles:                       1.459
Power On Hours:                     8.288
Unsafe Shutdowns:                   278
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

sudo nvme 错误日志 /dev/nvme0n1p6

sudo nvme error-log  /dev/nvme0n1p6
Error Log Entries for device:nvme0n1p6 entries:64
.................
 Entry[ 0]   
.................
error_count : 0
sqid        : 0
cmdid       : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag   : 0
parm_err_loc    : 0
lba     : 0
nsid        : 0
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................
 Entry[ 1]   
.................
error_count : 0
sqid        : 0
cmdid       : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag   : 0
parm_err_loc    : 0
lba     : 0
nsid        : 0
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................
 Entry[ 2]   
.................
error_count : 0
sqid        : 0
cmdid       : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag   : 0
parm_err_loc    : 0
lba     : 0
nsid        : 0
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................
...
...
.................
error_count : 0
sqid        : 0
cmdid       : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag   : 0
parm_err_loc    : 0
lba     : 0
nsid        : 0
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................

sudo fsck -f /dev/nvme0n1p6

sudo fsck -f /dev/nvme0n1p6

fsck from util-linux 2.37.2
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p6: 2052426/21331968 files (0.9% non-contiguous), 56383172/85299200 blocks
echo $?
0

e2fsck 检查坏块(非破坏性读写测试)无错误

sudo e2fsck -fccky /dev/nvme0n1p6

$ fwupdmgr 更新

$ fwupdmgr update
Devices with no available firmware updates: 
 • ELAN1300:00 04F3:3104
 • INTEL SSDPEKNW512G8
 • System Firmware
 • UEFI Device Firmware
 • UEFI dbx
No updatable devices

$ fwupdmgr 获取设备

$ fwupdmgr get-devices
VivoBook_ASUSLaptop X521EA_K533EA
├─ELAN1300:00 04F3:3104:
│     Device ID:          3ab6179e75c876a50f6dcb40ae0a83ac471fb394
│     Summary:            Touchpad
│     Current version:    0x0001
│     Bootloader Version: 0x0000
│     Vendor:             HIDRAW:0x04F3
│     GUIDs:              dcbfd629-c2d8-53b4-bb8b-306fd916f0e0
│                         9573bac6-3cee-5094-90cd-4d7dc8122a8e
│                         bd873b66-c478-5130-9968-00dc0d89d15d
│                         3852e430-731f-55fa-a0e1-f2ff3b818c9f
│                         646d07fa-2f99-5404-870e-e834a3386353
│     Device Flags:       • Internal device
│                         • Updatable
├─INTEL SSDPEKNW512G8:
│     Device ID:          c430a03ca2a65dfe2412ff950c79c51f6aec1317
│     Summary:            NVM Express solid state drive
│     Current version:    004C
│     Vendor:             Intel Corporation (NVME:0x8086)
│     GUIDs:              c5fe8b70-dc9a-5c3b-9634-659091d29812
│                         1122104f-b10a-5f32-bc13-7a1ac0f52ea2
│                         c6cd9ab0-8f20-512e-9e1f-1af55b8454b9
│                         82741c78-f5dc-5c23-a152-00de5799edc8
│                         2b8c6418-6719-51b3-a700-f6061c86874b
│     Device Flags:       • Updatable
│                         • System requires external power source
│                         • Needs a reboot after installation
├─System Firmware:
│ │   Device ID:          a45df35ac0e948ee180fe216a5f703f32dda163f
│ │   Summary:            UEFI ESRT device
│ │   Current version:    787
│ │   Minimum Version:    787
│ │   Vendor:             ASUSTeK COMPUTER INC. (DMI:American Megatrends International, LLC.)
│ │   Update State:       Success
│ │   GUIDs:              60c270d7-c1c7-55d6-a556-f8ed502657b8
│ │                       230c8b18-8d9b-53ec-838b-6cfc0383493a
│ │   Device Flags:       • Internal device
│ │                       • Updatable
│ │                       • System requires external power source
│ │                       • Needs a reboot after installation
│ │                       • Cryptographic hash verification is available
│ │                       • Device is usable for the duration of the update
│ │                       • Full disk encryption secrets may be invalidated when updating
│ │ 
│ └─UEFI dbx:
│       Device ID:        362301da643102b9f38477387e2193e57abaa590
│       Summary:          UEFI revocation database
│       Current version:  272
│       Minimum Version:  272
│       Vendor:           UEFI:Linux Foundation
│       Install Duration: 1 second
│       GUIDs:            6c9777b8-19f2-5e2c-9210-66ef3691a9f3
│                         c8749f7f-439b-5c3c-a2ea-3baacf663a5a
│                         c6682ade-b5ec-57c4-b687-676351208742
│                         f8ba2887-9411-5c36-9cee-88995bb39731
│                         7d5759e5-9aa0-5f0c-abd6-7439bb11b9f6
│                         0c7691e1-b6f2-5d71-bc9c-aabee364c916
│       Device Flags:     • Internal device
│                         • Updatable
│                         • Needs a reboot after installation
│                         • Only version upgrades are allowed
│                         • Signed Payload
└─UEFI Device Firmware:
      Device ID:          349bb341230b1a86e5effe7dfe4337e1590227bd
      Summary:            UEFI ESRT device
      Current version:    1
      Vendor:             DMI:American Megatrends International, LLC.
      Update State:       Success
      GUID:               9bb97156-241b-34a5-90be-06f0048895e5
      Device Flags:       • Internal device
                          • Updatable
                          • System requires external power source
                          • Needs a reboot after installation
                          • Device is usable for the duration of the update

答案1

我设法解决了这个问题,从那以后机器就没再停机过。我修改了 grub 以将默认电源状态最大延迟设置为 0,然后关闭 APST

如果 SSD 崩溃,则没有日志条目,因为没有内容可写入。我在关机前设法拍下了屏幕上从内存写入的内容。这是一张关于崩溃期间日志的照片。 ssd 错误日志

根据此图像,我修改了 GRUB 设置

因此,这是解决方案:

# 1. 备份

sudo cp /etc/default/grub /etc/default/grub.$(date +%Y-%m-%d)

# 2. grub 编辑

sudo gedit /etc/default/grub

#3. 这个要被替换

GRUB_CMDLINE_LINUX_DEFAULT="安静的启动"

对此

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0 pcie_aspm=off*"

# 节省

# 更新 grub

sudo update-grub

# 重启

sudo reboot

# 验证

sudo nvme get-feature /dev/nvme0 -f 0x0c -H

get-feature:0x0c (Autonomous Power State Transition), Current value:00000000
    Autonomous Power State Transition Enable (APSTE): Disabled
    Auto PST Entries    .................
    Entry[ 0]   
    .................
    Idle Time Prior to Transition (ITPT): 0 ms
    Idle Transition Power State   (ITPS): 0
#

# default_ps_max_latency_us = default power state max latency (microseconds)
# Users can set ps_max_latency_us to zero to turn off APST
# So when set to 0, the SSD won't enter power management states autonomously, which means it should remain operational and not enter any power-saving modes

希望它能帮助遇到同样问题的人。

相关内容