无法识别我的 NVMe 磁盘的 SMART 错误/问题

无法识别我的 NVMe 磁盘的 SMART 错误/问题

我定期收到智能守护进程发送的有关我的 NVMe 磁盘的电子邮件。

在主机上检测到 SMART 错误 (ErrorCount):desk

This message was generated by the smartd daemon running on:

   host name:  [redacted]
   DNS domain: [redacted]

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 2519 to 2521

Device info:
KBG30ZMV256G TOSHIBA, S/N:X8OPD1PGP12P, FW:ADHA0101

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Oct  7 23:38:04 2023 EDT
Another message will be sent in 24 hours if the problem persists.

几个月来我一直在尝试解决这个问题,但一直没有成功。以下是我尝试过的各种命令及其输出。

smartctl -a /dev/nvme0

$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KBG30ZMV256G TOSHIBA
Serial Number:                      X8OPD1PGP12P
Firmware Version:                   ADHA0101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
NVMe Version:                       1.2.1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00080d 04004ad9aa
Local Time is:                      Sun Oct 15 17:53:35 2023 EDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0017):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.30W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.30W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    4  4  4  4     8000   32000
 4 -   0.0050W       -        -    4  4  4  4     8000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -    4096       0         0
 1 +     512       0         3

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    33%
Data Units Read:                    35,454,740 [18.1 TB]
Data Units Written:                 70,575,255 [36.1 TB]
Host Read Commands:                 306,457,518
Host Write Commands:                881,616,851
Controller Busy Time:               12,766
Power Cycles:                       342
Power On Hours:                     21,991
Unsafe Shutdowns:                   617
Media and Data Integrity Errors:    0
Error Information Log Entries:      2,528
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2528     0  0x301c  0xc002  0x000            -     4     -
  1       2527     0  0x201d  0xc004  0x028            -     1     -
  2       2526     0  0x101d  0xc004  0x028            -     1     -
  3       2525     0  0x6005  0xc002  0x000            -     4     -
  4       2524     0  0x6004  0xc004  0x028            -     1     -
  5       2523     0  0x5006  0xc004  0x028            -     1     -
  6       2522     0  0x1006  0xc005  0x028            -     1     -
  7       2521     0  0x4013  0xc005  0x028            -     0     -

nvme error-log /dev/nvme0

nvme.log

nvme list

$ sudo ./nvme-cli-latest-x86_64.AppImage list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            X8OPD1PGP12P         KBG30ZMV256G TOSHIBA                     0x1        256.06  GB / 256.06  GB    512   B +  0 B   ADHA0101

答案1

根据以下回答比克隆德这里

如果您要问该错误代码是如何在 0xC502 中编码的,那么它的编码方式是 0xC502 >> 1,以摆脱阶段标记。这样我们就得到了 0x6281。然后应用掩码 0x7ff 来提取较低的 11 个字节(3 个用于状态代码类型,8 个用于状态代码),最终得到 0x281。0x2xx 是“媒体和数据完整性错误”,0x81 状态代码是“未恢复的读取错误”。

我们可以对您的错误应用相同的逻辑。

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2528     0  0x301c  0xc002  0x000            -     4     -

状态 0xc002

  • 删除阶段标签(与除以 2 相同):0x6001。
  • 应用掩码 0x7ff(与取右边三个半字节相同)得到 0x001。
  • 0x0xx 给了我们 NVME_STATUS_TYPE_GENERIC_COMMAND
  • 0x01 给我们 NVME_STATUS_INVALID_COMMAND_OPCODE

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  1       2527     0  0x201d  0xc004  0x028            -     1     -

状态 0xc004

  • 摆脱阶段标签:0x6002。
  • 应用掩码 0x7ff 得到 0x002。
  • 0x0xx 给了我们 NVME_STATUS_TYPE_GENERIC_COMMAND
  • 0x02 给我们 NVME_STATUS_INVALID_FIELD_IN_COMMAND

ETC..

一般来说,这种类型的错误是由向 NVMe SSD 发送无效或不受支持的命令引起的,因此无需担心。

查找代码:

NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,

// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND

NVME_STATUS_SUCCESS_COMPLETION = 0x00,
NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01,
NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02,
NVME_STATUS_COMMAND_ID_CONFLICT = 0x03,
NVME_STATUS_DATA_TRANSFER_ERROR = 0x04,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05,
NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06,
NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A,
NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B,
NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C,
NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D,
NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E,
NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F,
NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10,
NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11,
NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12,
NVME_STATUS_PRP_OFFSET_INVALID = 0x13,
NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14,
NVME_STATUS_OPERATION_DENIED = 0x15,
NVME_STATUS_SGL_OFFSET_INVALID = 0x16,
NVME_STATUS_RESERVED = 0x17,
NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B,
NVME_STATUS_SANITIZE_FAILED = 0x1C,
NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D,
NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E,
NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70,
NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71,
NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80,
NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81,
NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82,
NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83,
NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,

// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC

NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00,
NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01,
NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02,
NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03,
NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05,
NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06,
NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07,
NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08,
NVME_STATUS_INVALID_LOG_PAGE = 0x09,
NVME_STATUS_INVALID_FORMAT = 0x0A,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B,
NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C,
NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D,
NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E,
NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12,
NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13,
NVME_STATUS_OVERLAPPING_RANGE = 0x14,
NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15,
NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16,
NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18,
NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19,
NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A,
NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B,
NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C,
NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D,
NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E,
NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F,
NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20,
NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21,
NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22,
NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F,
NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80,
NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81,
NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,

// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR

NVME_STATUS_NVM_WRITE_FAULT = 0x80,
NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81,
NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82,
NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_STATUS_NVM_COMPARE_FAILURE = 0x85,
NVME_STATUS_NVM_ACCESS_DENIED = 0x86,
NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,

答案2

除了@joep-van-steen的回答之外,如果您可以安装该nvme-cli软件包(所有主要发行版中默认提供),该命令将为您解码:

显示整个错误日志(带有解码的描述)——sudo当然必须运行:

# nvme error-log /dev/nvme0

或者,仅检索包含“错误”的行:

# nvme error-log /dev/nvme0 | egrep -i 'status_field\s+\:\s+0[^\(]'

最后一个命令的输出将是:

status_field    : 0x4002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)

相关内容