我的笔记本电脑有一个金士顿KC3000NVMe SSD 在大量 IO 期间变得不可用。这是我的启动驱动器,因此这个问题会导致 Linux 崩溃。我尝试过设置内核参数nvme_core.default_ps_max_latency_us=0
,pcie_aspm=off
但这并没有解决问题。我不认为原因是热节流,因为出错时驱动器温度为 61°C,并且工作温度指定为 0°C~70°C。我不知道这是硬件问题还是软件错误,也不知道下一步要尝试什么。
再生产
我可以使用大量写入来重现该问题菲奥
mint@mint:~$ fio --name=write_test_100GB --rw=write --size=100GB --filename=/mnt/303637a5-5c86-467e-938e-af26d8d667e0/fio
write_test_100GB: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.28
Starting 1 process
write_test_100GB: Laying out IO file (1 file / 102400MiB)
fio: io_u error on file /mnt/303637a5-5c86-467e-938e-af26d8d667e0/fio: Read-only file system: write offset=76983128064, buflen=4096
fio: pid=4636, err=30/file:io_u.c:1845, func=io_u error, error=Read-only file system
write_test_100GB: (groupid=0, jobs=1): err=30 (file:io_u.c:1845, func=io_u error, error=Read-only file system): pid=4636: Sun Apr 9 10:14:44 2023
write: IOPS=190k, BW=743MiB/s (779MB/s)(71.7GiB/98845msec); 0 zone resets
clat (nsec): min=1306, max=29625M, avg=4996.06, stdev=6834410.60
lat (nsec): min=1329, max=29625M, avg=5030.62, stdev=6834410.60
clat percentiles (nsec):
| 1.00th=[ 1464], 5.00th=[ 1496], 10.00th=[ 1512], 20.00th=[ 1592],
| 30.00th=[ 1784], 40.00th=[ 1976], 50.00th=[ 2256], 60.00th=[ 2480],
| 70.00th=[ 2928], 80.00th=[ 3440], 90.00th=[ 4512], 95.00th=[ 5792],
| 99.00th=[ 8896], 99.50th=[10688], 99.90th=[19072], 99.95th=[26240],
| 99.99th=[47872]
bw ( MiB/s): min= 5, max= 2064, per=100.00%, avg=1056.23, stdev=318.45, samples=139
iops : min= 1500, max=528396, avg=270394.14, stdev=81524.32, samples=139
lat (usec) : 2=40.73%, 4=45.10%, 10=13.51%, 20=0.56%, 50=0.09%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, >=2000=0.01%
cpu : usr=6.94%, sys=50.32%, ctx=2332, majf=0, minf=26
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,18794710,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=743MiB/s (779MB/s), 743MiB/s-743MiB/s (779MB/s-779MB/s), io=71.7GiB (77.0GB), run=98845-98845msec
这会导致控制器重置,然后驱动程序变得不可用
Apr 9 10:14:44 mint kernel: [ 816.267878] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Apr 9 10:14:44 mint kernel: [ 816.339860] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Apr 9 10:14:44 mint kernel: [ 816.340571] nvme nvme0: Removing after probe failure status: -19
Apr 9 10:14:44 mint kernel: [ 816.363768] nvme0n1: detected capacity change from 4000797360 to 0
Apr 9 10:14:44 mint kernel: [ 816.363827] blk_update_request: I/O error, dev nvme0n1, sector 25368832 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Apr 9 10:14:44 mint kernel: [ 816.363826] blk_update_request: I/O error, dev nvme0n1, sector 3792103000 op 0x1:(WRITE) flags 0x104000 phys_seg 6 prio class 0
Apr 9 10:14:44 mint kernel: [ 816.363842] blk_update_request: I/O error, dev nvme0n1, sector 3792105560 op 0x1:(WRITE) flags 0x104000 phys_seg 6 prio class 0
Apr 9 10:14:44 mint kernel: [ 816.363841] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.363864] blk_update_request: I/O error, dev nvme0n1, sector 3792108120 op 0x1:(WRITE) flags 0x104000 phys_seg 6 prio class 0
Apr 9 10:14:44 mint kernel: [ 816.363871] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.363968] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.364180] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 3, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.364311] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.364834] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 5, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.364898] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 6, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.365008] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 7, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.365084] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 8, flush 0, corrupt 0, gen 0
Apr 9 10:14:44 mint kernel: [ 816.365111] BTRFS: error (device nvme0n1p2) in btrfs_run_delayed_refs:2149: errno=-5 IO failure
Apr 9 10:14:44 mint kernel: [ 816.365117] BTRFS info (device nvme0n1p2): forced readonly
Apr 9 10:14:44 mint kernel: [ 816.366446] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 8, flush 0, corrupt 0, gen 0
我对计算机中的另一个 NVMe SSD(TEAM MP33 PRO)尝试了相同的写入测试,但没有崩溃。然而,缓存已满,驱动器性能逐渐停止。
研究
Arch Wiki 有一个关于 [由于 APST 支持中断而导致控制器故障] 的部分https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Power_Saving_(APST)其中提到了早期的金士顿驱动器和较旧的内核。我的驱动器没有固件更新,并且我正在运行 5.19 内核。
我还阅读了这些建议default_ps_max_latency_us
和pcie_aspm
调整。
- ServerFault - Linus/ext4/nvme 在高 io 期间崩溃
- 询问 Ubuntu - 询问 Ubuntu - Kubuntu 20.04 中随机冻结
- 询问 Ubuntu - WD (Sandisk) NVMe M.2 棒不太工作
- [Linux 邮件列表 -BUG][5.18rc5] nvme nvme0:控制器已关闭;将重置:CSTS=0xffffffff,PCI_STATUS=0x10
- r/pop_os - 安装新 NVME SSD 后遇到随机冻结
- Arch Linux 论坛 - NVMe SSD I/O 错误,不确定是否硬件故障
系统信息
因西
System:
Host: minted Kernel: 5.19.0-38-generic x86_64 bits: 64
Desktop: Cinnamon 5.6.8 Distro: Linux Mint 21.1 Vera
Machine:
Type: Laptop System: Metabox product: Flo L140PU v: N/A
serial: <superuser required>
Mobo: IT Channel Pty model: L140PU serial: <superuser required>
UEFI: INSYDE v: 1.07.03TMB date: 05/17/2022
Battery:
ID-1: BAT0 charge: 49.0 Wh (67.8%) condition: 72.3/73.9 Wh (97.8%)
volts: 8.0 min: 7.7
CPU:
Info: 10-core (2-mt/8-st) model: 12th Gen Intel Core i5-1235U bits: 64
type: MST AMCP cache: L2: 6.5 MiB
Speed (MHz): avg: 2343 min/max: 400/4400:3300 cores: 1: 575 2: 2500
3: 684 4: 2500 5: 2500 6: 2806 7: 2500 8: 3281 9: 2500 10: 3281 11: 2500
12: 2500
Graphics:
Device-1: Intel driver: i915 v: kernel
Device-2: Chicony USB2.0 Camera type: USB driver: uvcvideo
Display: x11 server: X.Org v: 1.21.1.3 driver: X: loaded: modesetting
unloaded: fbdev,vesa gpu: i915 resolution: 1: 2560x1440~60Hz
2: 2560x1440~60Hz
OpenGL: renderer: Mesa Intel Graphics (ADL GT2) v: 4.6 Mesa 22.2.5
Audio:
Device-1: Intel Alder Lake PCH-P High Definition Audio
driver: snd_hda_intel
Device-2: Audioengine D1 24-bit DAC type: USB
driver: hid-generic,snd-usb-audio,usbhid
Device-3: Antlion Audio USB Microphone type: USB
driver: hid-generic,snd-usb-audio,usbhid
Sound Server-1: ALSA v: k5.19.0-38-generic running: yes
Sound Server-2: PulseAudio v: 15.99.1 running: yes
Sound Server-3: PipeWire v: 0.3.66 running: yes
Network:
Device-1: Intel Alder Lake-P PCH CNVi WiFi driver: iwlwifi
IF: wlp0s20f3 state: up mac: 70:a6:cc:2e:c9:35
Bluetooth:
Device-1: Intel AX201 Bluetooth type: USB driver: btusb
Report: hciconfig ID: hci0 rfk-id: 0 state: down
bt-service: enabled,running rfk-block: hardware: no software: yes
address: 70:A6:CC:2E:C9:39
Drives:
Local Storage: total: 2.33 TiB used: 2.02 TiB (86.8%)
ID-1: /dev/nvme0n1 vendor: Kingston model: SKC3000D2048G size: 1.86 TiB
ID-2: /dev/nvme1n1 vendor: TeamGroup model: TM8FPD512G size: 476.94 GiB
Partition:
ID-1: / size: 1.86 TiB used: 1.68 TiB (89.9%) fs: btrfs dev: /dev/nvme0n1p2
ID-2: /boot/efi size: 486 MiB used: 6.1 MiB (1.2%) fs: vfat
dev: /dev/nvme0n1p1
ID-3: /home size: 1.86 TiB used: 1.68 TiB (89.9%) fs: btrfs
dev: /dev/nvme0n1p2
Swap:
ID-1: swap-1 type: partition size: 1.91 GiB used: 0 KiB (0.0%)
dev: /dev/nvme1n1p1
Sensors:
System Temperatures: cpu: 66.0 C mobo: N/A
Fan Speeds (RPM): N/A
Info:
Processes: 354 Uptime: 4m Memory: 23.19 GiB used: 2.17 GiB (9.3%)
Shell: Bash inxi: 3.3.13
lspci
00:00.0 Host bridge: Intel Corporation Device 4601 (rev 04)
00:02.0 VGA compatible controller: Intel Corporation Device 46a8 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Alder Lake Innovation Platform Framework Processor Participant (rev 04)
00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 04)
00:07.0 PCI bridge: Intel Corporation Alder Lake-P Thunderbolt 4 PCI Express Root Port #0 (rev 04)
00:08.0 System peripheral: Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator (rev 04)
00:0a.0 Signal processing controller: Intel Corporation Platform Monitoring Technology (rev 01)
00:0d.0 USB controller: Intel Corporation Alder Lake-P Thunderbolt 4 USB Controller (rev 04)
00:0d.2 USB controller: Intel Corporation Alder Lake-P Thunderbolt 4 NHI #0 (rev 04)
00:14.0 USB controller: Intel Corporation Alder Lake PCH USB 3.2 xHCI Host Controller (rev 01)
00:14.2 RAM memory: Intel Corporation Alder Lake PCH Shared SRAM (rev 01)
00:14.3 Network controller: Intel Corporation Alder Lake-P PCH CNVi WiFi (rev 01)
00:15.0 Serial bus controller: Intel Corporation Alder Lake PCH Serial IO I2C Controller #0 (rev 01)
00:15.1 Serial bus controller: Intel Corporation Alder Lake PCH Serial IO I2C Controller #1 (rev 01)
00:16.0 Communication controller: Intel Corporation Alder Lake PCH HECI Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation Device 51bd (rev 01)
00:1d.0 PCI bridge: Intel Corporation Device 51b0 (rev 01)
00:1f.0 ISA bridge: Intel Corporation Alder Lake PCH eSPI Controller (rev 01)
00:1f.3 Audio device: Intel Corporation Alder Lake PCH-P High Definition Audio Controller (rev 01)
00:1f.4 SMBus: Intel Corporation Alder Lake PCH-P SMBus Host Controller (rev 01)
00:1f.5 Serial bus controller: Intel Corporation Alder Lake-P PCH SPI Controller (rev 01)
01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01)
2b:00.0 SD Host controller: O2 Micro, Inc. SD/MMC Card Reader Controller (rev 01)
2c:00.0 Non-Volatile memory controller: Phison Electronics Corporation PS5013 E13 NVMe Controller (rev 01)
聪明的
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 34 C (307 Kelvin)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 17,450,136
data_units_written : 8,395,033
host_read_commands : 40,201,204
host_write_commands : 38,110,961
controller_busy_time : 268
power_cycles : 480
power_on_hours : 608
unsafe_shutdowns : 275
media_errors : 0
num_err_log_entries : 2,790
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 2 : 77 C (350 Kelvin)
Thermal Management T1 Trans Count : 7
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 3830
Thermal Management T2 Total Time : 0
亚太标准时间
get-feature:0x0c (Autonomous Power State Transition), Current value:00000000
Autonomous Power State Transition Enable (APSTE): Disabled
Auto PST Entries .................