发生了什么

发生了什么

我的Solid State Disk已经成为了吗Super Slim Doorstopper

我知道这是一个很长的问题,但我试图让它尽可能详尽和翔实。对于一个tl;dr直接跳过问题的前半部分的人来说,虽然我认为那里的信息可能与这个问题相关。

发生了什么

首先:我住在一个目前正在遭受严重热浪侵袭的地区。2-3 周以来,我房间的室内空气温度从未低于 30°C。几天以来,温度从未低于 34°C,即使是半夜也没有。我没有空调,我的风扇几乎什么也不转。我的 SSD 的温度传感器似乎坏了(总是报告 5°C),我的 HDD 几乎总是在 48°C、54°C 和 54°C。GPU 大约 60°C,CPU 大约 52°C。这不太好,但对我来说听起来还是可以忍受的。

昨晚,我在使用我的 PC(64GB SSD 上的 Arch Linux)时,一切都冻结了。我甚至无法再通过 SSH 进入机器。因此,在等待了半个小时,希望至少能获得一个 SSH 连接后,我不得不关闭电源。我还想提一下,有时当我使用 Audacity 时,我的 PC 会变得非常慢(将临时数据写入 SSD,因为 Audacity 似乎不支持 NTFS 文件系统,而我的 SSD 是我拥有的唯一非 NTFS 文件系统),并且最近我遇到了问题是关于 SSD 满了之后速度会变慢。我可以说,由于大量的 Audacity 录制,我的 SSD 每周都会多次(如果不是每天的话)达到 +95% 的使用空间。

因此,关闭 PC 后,我尝试再次将其打开,在 BIOS 屏幕上,它检查了所有磁盘,SSD 显示S.M.A.R.T. error。在启动 grub(在另一个驱动器上)并尝试启动 arch(启动分区也在另一个驱动器上)后,我收到了消息Device /dev/mapper/mydisk-root not found,或类似内容。mydisk-root应该是我的 LUKS 加密 SSD 卷组内的根分区。因此,我尝试重新启动几次,但总是得到相同的结果,最后我放弃了,关闭了 PC(在 PSU 上)并进入睡眠状态。

我执行的下一步行动

醒来后,我想启动实时 Linux USB 来执行 SMART 扫描,查看 dmesg,不管有什么。突然 BIOSS.M.A.R.T. ok又出现了。但我继续使用实时 USB,在那里我可以像往常一样解锁和安装 SSD。我也可以毫无问题地执行完整备份。

然后我去参加了 SMART 测试。long测试两次都失败了,成功率只有 50%,详情如下。测试short完成了,结果中我看不出有什么不好的地方。我参加的最后一次 SMART 测试是在 2 周前,这是一次long测试(参见测试日志),一切都很好。

问题 1:我的 SSD 状况如何?

这是我尝试过任何测试的 SMART 属性表的输出,所以我认为这些应该是我两周前进行的测试before的结果:long

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

-a这是今天尝试测试后的完整结果long,测试失败(参见测试日志):

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 117) The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:        (  295) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   4) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 d0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 c8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 03 08 c0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 10 08 b8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 b0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

Error -1 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 d5 00 d8 13 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 00 d8 12 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 da 00 d8 11 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d0 00 d8 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d1 80 58 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%     23891         66387896
# 2  Extended offline    Completed: read failure       50%     23889         66387896
# 3  Extended offline    Completed without error       00%     23437         -
# 4  Short offline       Completed without error       00%       564         -
# 5  Vendor (0xff)       Completed without error       00%       558         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

-a这是今天尝试测试后的完整结果short,测试成功:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  295) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   4) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       23891
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       1063
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       10
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0033   080   080   010    Pre-fail  Always       -       611
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       244
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       302 89 212
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       2
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       58
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       1
202 Perc_Rated_Life_Used    0x0018   080   080   001    Old_age   Offline      -       20
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       10

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 2

ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 d0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 c8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 03 08 c0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 10 08 b8 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 08 b0 14 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

Error -1 occurred at disk power-on lifetime: 23890 hours (995 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 d0 14 d1 40   at LBA = 0x00d114d0 = 13702352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 d5 00 d8 13 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 00 00 d8 12 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 da 00 d8 11 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d0 00 d8 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED
  60 d1 80 58 10 d1 40 00   1d+05:22:14.080  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23891         -
# 2  Extended offline    Completed: read failure       50%     23891         66387896
# 3  Extended offline    Completed: read failure       50%     23889         66387896
# 4  Extended offline    Completed without error       00%     23437         -
# 5  Short offline       Completed without error       00%       564         -
# 6  Vendor (0xff)       Completed without error       00%       558         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我觉得这三个属性表都一样,这很有趣。还是我这里漏掉了什么?我不是 SMART 专家,但据我所知,这三个都是完美的结果。(?)我还没有尝试,但由于安装和获取文件成功,并且 BIOS 再次报告,ok我认为我也可以再次启动它。我应该吗?

问题2:为什么会发生这种情况?

这仅仅是由于老化造成的,还是因为我在 SSD 上持续使用 Audacity 导致了这种情况?

这与 SSD 使用空间不断达到 90-100% 有关吗?

如何才能从一切安好我甚至无法再进行 SMART 测试仅在两周之内?

这些智能测试结果说明了什么?今天测试后的属性表对我来说仍然看起来很棒,还是我错了?

问题3:这会传染吗?

如果这个 SSD 坏了,我要买一个新的,我能简单地dd if=/old/ssd of=/new/ssd解决问题吗?还是会造成麻烦?移动到新磁盘的最佳方法是什么?请注意,我在整个设备上以 RAW 模式使用带有分离头的 LUKS,我只想将所有这些“克隆”到新磁盘上。


编辑:我刚刚再次启动了该 SSD,它似乎可以正常工作。不过我会尽快买一个新的 SSD,因为我认为使用这个 SSD 是个坏主意。以下是崩溃前 syslos 中的最新条目:

系统

答案1

SMART 状态显示了很多旧的或垂死的指标,但没有任何东西特别尖叫“这杀死了它!”。

您的日志显示开机寿命为 995 天 10 小时,这表明您一直让机器处于开启状态,这本身并不是坏事,它只是意味着驱动器在操作系统进行簿记和一般使用时进行了大量小写入。

在我看来,SSD 只是老旧了,磨损了。Perc_Rated_Life_Used令人惊讶的是,Erase_Fail_Count

让我担心的是,你的“常规”使用率会达到 95% 以上,这会减少磨损均衡算法可用的空块池。在空间不足的时候,你最终会给一小部分块施加更大的压力,从而导致一小部分块的写入量很大,而整个驱动器的平均写入量却很低。通过反复执行此操作,磨损均衡器可能会首先选择“最佳”(写入量最少)的块进行写入,但当你达到 100% 满时,剩下的就是“最差”的块。再加上一般程序和操作系统运行其任务,意味着你会更快地磨损最差的块。这是给驱动器最差部分施加压力并将其送入坟墓的完美方式。

您实际上将关键文件系统和 SSD 簿记功能强制放入最差的单元,因为它们可能会定期写入驱动器,特别是当 SSD 几乎已满时,迟早会发生一些糟糕的事情。如果您用完了可重新分配的块,并且无法移动关键结构,那么驱动器可能会自行死锁。

这就是为什么人们说你应该总是试着在你的驱动器上保留一些空闲空间,因为空闲空间越少,你就越难处理自由的。

有可能是由于数据老旧以及对小组块的频繁写入,导致驱动器的某些部分磨损。

将您需要的内容复制到新驱动器上就没问题了,像这样的硬件故障往往不会传染。

相关内容