如何调试神秘的操作系统问题

如何调试神秘的操作系统问题

我买了一台新电脑还不到一年(4cpus 英特尔 I5、32GB 内存、250GB SSD)。我全新安装了 Debian 10。我安装的东西非常精简 - 将臃肿的东西降到最低,这样我就可以快速运行新操作系统。

在过去的几天里,我注意到一种奇怪的模式。我有这些非常大的文件(用 zst 压缩),有时需要解压缩。它们压缩后大约有 1GB,解压缩后大约有 15GB(这不是一项艰巨的任务,但对我的系统来说肯定不容忽视)。我使用 解压缩它们zstd -cd 20201216.zst > 20201216.log。运行时,zstd打印目前的进度。我注意到它有时会停止 20-30 秒,然后恢复。起初我以为是我不小心启动了多个任务,是某种争用导致了这种情况。但检查后htop您会发现操作系统上同时发生的情况很少(大量可用 RAM,所有 4 个 CPU 约占 1%)。此外,我检查了它,iotop发现当zstd它说它正在工作时,iotop显示非常大的 100MB/s 读写速度。当zstd没有进展时,iotop显示 0B/s 读写。所以问题既不是 CPU 争用也不是磁盘争用。

有时,但很少,整个系统会在此过程中冻结。大多数情况下,在zstd冻结期间我都可以正常使用系统。

我还应该看看什么来调试这个问题?

编辑:我已运行 smartctl,以下是报告。我还不知道如何解释它,正在研究它。

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     VENO SCORP SSD 240GB
Serial Number:    GSDMC206010008
Firmware Version: XKR905
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 20 17:36:11 2020 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)     Offline data collection activity
                                    was never started.
                                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)     The previous self-test routine completed
                                    without error or no self-test has ever
                                    been run.
Total time to complete Offline
data collection:            (  120) seconds.
Offline data collection
capabilities:                        (0x11) SMART execute Offline immediate.
                                    No Auto Offline data collection support.
                                    Suspend Offline collection upon new
                                    command.
                                    No Offline surface scan supported.
                                    Self-test supported.
                                    No Conveyance Self-test supported.
                                    No Selective Self-test supported.
SMART capabilities:            (0x0002)     Does not save SMART data before
                                    entering power-saving mode.
                                    Supports SMART auto save timer.
Error logging capability:        (0x01)     Error logging supported.
                                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       816
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       138
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       17
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       9546
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       30
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       13
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1500
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       27
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       7896
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       19968
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       5880
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       25733

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d0 01 00 4f c2 00 08      00:00:00.000  SMART READ DATA
  b0 d1 01 01 4f c2 00 08      00:00:00.000  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  b0 da 00 00 4f c2 00 08      00:00:00.000  SMART RETURN STATUS
  b0 d5 01 00 4f c2 00 08      00:00:00.000  SMART READ LOG
  b0 d5 01 01 4f c2 00 08      00:00:00.000  SMART READ LOG

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

答案1

如果您想在 Linux 终端中检查 HDD/SSD 是否存在错误,我建议使用 Linux 的 HDSentinel 而不是 smartctl,因为结果更容易读取...

Linux 版 HDSentinel

只需下载,解压到 /usr/bin,chmod 为 755 并以 root 身份运行

sudo hdsentinel -dev /dev/sda (或将 sda 替换为您的驱动器名称)

相关内容