我定期收到智能守护进程发送的有关我的 NVMe 磁盘的电子邮件。
在主机上检测到 SMART 错误 (ErrorCount):desk
This message was generated by the smartd daemon running on:
host name: [redacted]
DNS domain: [redacted]
The following warning/error was logged by the smartd daemon:
Device: /dev/nvme0, number of Error Log entries increased from 2519 to 2521
Device info:
KBG30ZMV256G TOSHIBA, S/N:X8OPD1PGP12P, FW:ADHA0101
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Oct 7 23:38:04 2023 EDT
Another message will be sent in 24 hours if the problem persists.
几个月来我一直在尝试解决这个问题,但一直没有成功。以下是我尝试过的各种命令及其输出。
smartctl -a /dev/nvme0
$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: KBG30ZMV256G TOSHIBA
Serial Number: X8OPD1PGP12P
Firmware Version: ADHA0101
PCI Vendor/Subsystem ID: 0x1179
IEEE OUI Identifier: 0x00080d
Controller ID: 0
NVMe Version: 1.2.1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00080d 04004ad9aa
Local Time is: Sun Oct 15 17:53:35 2023 EDT
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0017): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.30W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.30W - - 2 2 2 2 0 0
3 - 0.0500W - - 4 4 4 4 8000 32000
4 - 0.0050W - - 4 4 4 4 8000 40000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 4096 0 0
1 + 512 0 3
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 31 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 33%
Data Units Read: 35,454,740 [18.1 TB]
Data Units Written: 70,575,255 [36.1 TB]
Host Read Commands: 306,457,518
Host Write Commands: 881,616,851
Controller Busy Time: 12,766
Power Cycles: 342
Power On Hours: 21,991
Unsafe Shutdowns: 617
Media and Data Integrity Errors: 0
Error Information Log Entries: 2,528
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 31 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 2528 0 0x301c 0xc002 0x000 - 4 -
1 2527 0 0x201d 0xc004 0x028 - 1 -
2 2526 0 0x101d 0xc004 0x028 - 1 -
3 2525 0 0x6005 0xc002 0x000 - 4 -
4 2524 0 0x6004 0xc004 0x028 - 1 -
5 2523 0 0x5006 0xc004 0x028 - 1 -
6 2522 0 0x1006 0xc005 0x028 - 1 -
7 2521 0 0x4013 0xc005 0x028 - 0 -
nvme error-log /dev/nvme0
nvme list
$ sudo ./nvme-cli-latest-x86_64.AppImage list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1 /dev/ng0n1 X8OPD1PGP12P KBG30ZMV256G TOSHIBA 0x1 256.06 GB / 256.06 GB 512 B + 0 B ADHA0101
答案1
如果您要问该错误代码是如何在 0xC502 中编码的,那么它的编码方式是 0xC502 >> 1,以摆脱阶段标记。这样我们就得到了 0x6281。然后应用掩码 0x7ff 来提取较低的 11 个字节(3 个用于状态代码类型,8 个用于状态代码),最终得到 0x281。0x2xx 是“媒体和数据完整性错误”,0x81 状态代码是“未恢复的读取错误”。
我们可以对您的错误应用相同的逻辑。
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 2528 0 0x301c 0xc002 0x000 - 4 -
状态 0xc002
- 删除阶段标签(与除以 2 相同):0x6001。
- 应用掩码 0x7ff(与取右边三个半字节相同)得到 0x001。
- 0x0xx 给了我们 NVME_STATUS_TYPE_GENERIC_COMMAND
- 0x01 给我们 NVME_STATUS_INVALID_COMMAND_OPCODE
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
1 2527 0 0x201d 0xc004 0x028 - 1 -
状态 0xc004
- 摆脱阶段标签:0x6002。
- 应用掩码 0x7ff 得到 0x002。
- 0x0xx 给了我们 NVME_STATUS_TYPE_GENERIC_COMMAND
- 0x02 给我们 NVME_STATUS_INVALID_FIELD_IN_COMMAND
ETC..
一般来说,这种类型的错误是由向 NVMe SSD 发送无效或不受支持的命令引起的,因此无需担心。
查找代码:
NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,
// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND
NVME_STATUS_SUCCESS_COMPLETION = 0x00,
NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01,
NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02,
NVME_STATUS_COMMAND_ID_CONFLICT = 0x03,
NVME_STATUS_DATA_TRANSFER_ERROR = 0x04,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05,
NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06,
NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A,
NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B,
NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C,
NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D,
NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E,
NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F,
NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10,
NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11,
NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12,
NVME_STATUS_PRP_OFFSET_INVALID = 0x13,
NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14,
NVME_STATUS_OPERATION_DENIED = 0x15,
NVME_STATUS_SGL_OFFSET_INVALID = 0x16,
NVME_STATUS_RESERVED = 0x17,
NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B,
NVME_STATUS_SANITIZE_FAILED = 0x1C,
NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D,
NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E,
NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70,
NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71,
NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80,
NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81,
NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82,
NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83,
NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,
// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC
NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00,
NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01,
NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02,
NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03,
NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05,
NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06,
NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07,
NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08,
NVME_STATUS_INVALID_LOG_PAGE = 0x09,
NVME_STATUS_INVALID_FORMAT = 0x0A,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B,
NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C,
NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D,
NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E,
NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12,
NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13,
NVME_STATUS_OVERLAPPING_RANGE = 0x14,
NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15,
NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16,
NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18,
NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19,
NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A,
NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B,
NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C,
NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D,
NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E,
NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F,
NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20,
NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21,
NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22,
NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F,
NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80,
NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81,
NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,
// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR
NVME_STATUS_NVM_WRITE_FAULT = 0x80,
NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81,
NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82,
NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_STATUS_NVM_COMPARE_FAILURE = 0x85,
NVME_STATUS_NVM_ACCESS_DENIED = 0x86,
NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,
答案2
除了@joep-van-steen的回答之外,如果您可以安装该nvme-cli
软件包(所有主要发行版中默认提供),该命令将为您解码:
显示整个错误日志(带有解码的描述)——sudo
当然必须运行:
# nvme error-log /dev/nvme0
或者,仅检索包含“错误”的行:
# nvme error-log /dev/nvme0 | egrep -i 'status_field\s+\:\s+0[^\(]'
最后一个命令的输出将是:
status_field : 0x4002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)