我的服务器有两个硬盘(RAID1)、两个固态硬盘(RAID0)和一个 NVME。但有时硬盘会随机挂起几分钟。上次硬盘 20 分钟没有响应。我不知道这个问题可能是什么原因造成的。我检查了节点导出器的数据,发现没有什么异常。当时是晚上,所以服务器很平静。CPU 使用率、内存使用率或 IOPS 没有出现峰值。
出现此问题时,HDD 完全被阻塞,但我不确定 SSD 是否如此,NVME 是否不受影响。智能数据看起来不错。我不确定这是硬件问题还是软件问题。
有什么想法可以找到更多信息吗?
谢谢。
Debian Strech 4.19.0-0.bpo.1-amd64 来自日志的相关信息:
INFO: task md0_raid1:248 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-17-8 D 0 4937 2 0x80000000
Call Trace:
? __schedule+0x3f5/0x880
schedule+0x32/0x80
rwsem_down_read_failed+0x12e/0x190
? call_rwsem_down_read_failed+0x14/0x30
call_rwsem_down_read_failed+0x14/0x30
down_read+0x1c/0x30
dm_thin_find_block+0x2e/0x70 [dm_thin_pool]
thin_map+0x168/0x270 [dm_thin_pool]
__map_bio+0x42/0x170 [dm_mod]
__split_and_process_non_flush+0x12c/0x220 [dm_mod]
? __process_bio+0x170/0x170 [dm_mod]
__split_and_process_bio+0xb2/0x1a0 [dm_mod]
__dm_make_request.isra.31+0x3f/0xa0 [dm_mod]
generic_make_request+0x1e7/0x410
? submit_bio+0x6c/0x140
submit_bio+0x6c/0x140
? guard_bio_eod+0x36/0x100
submit_bh_wbc+0x163/0x190
? jbd2_journal_begin_ordered_truncate+0xa0/0xa0 [jbd2]
jbd2_journal_commit_transaction+0x5ec/0x18a0 [jbd2]
? __switch_to_asm+0x40/0x70
? __switch_to_asm+0x34/0x70
? __switch_to_asm+0x40/0x70
? kjournald2+0xc1/0x260 [jbd2]
kjournald2+0xc1/0x260 [jbd2]
? remove_wait_queue+0x60/0x60
kthread+0xf8/0x130
? commit_timeout+0x10/0x10 [jbd2]
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x35/0x40
INFO: task rs:main Q:Reg:5075 blocked for more than 120 seconds.
Not tainted 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
? alloc_set_pte+0x3f8/0x5b0
__do_page_fault+0x255/0x4f0
RAX: 00007f6d6e592638 RBX: 00007fff92429f80 RCX: 00007f6d6e4e57c0
智能数据
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Re
Device Model: WDC WD2004FBYZ-01YCBB1
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 216) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 182 182 021 Pre-fail Always - 3866
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 058 058 000 Old_age Always - 31119
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 62
16 Unknown_Attribute 0x0022 007 193 000 Old_age Always - 390630975468
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 42
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 86
194 Temperature_Celsius 0x0022 119 104 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 29834 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Re
Device Model: WDC WD2004FBYZ-01YCBB1
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 216) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 180 180 021 Pre-fail Always - 3991
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 058 058 000 Old_age Always - 31118
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 62
16 Unknown_Attribute 0x0022 006 194 000 Old_age Always - 370886226476
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 42
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 81
194 Temperature_Celsius 0x0022 116 106 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 29833 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2KG480G8
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 72) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9754
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 4
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 21474773638
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 066 000 Old_age Always - 30 (Min/Max 17/36)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 30
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 25630
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 61
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 69
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 585288
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 21474773638
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 25630
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 58487
243 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 132255
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 9516 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2KG480G8
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 72) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 099 099 000 Old_age Always - 1
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9754
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
170 Unknown_Attribute 0x0033 099 099 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 9
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 42949609951
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 065 000 Old_age Always - 31 (Min/Max 17/36)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 31
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 25308
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 61
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 69
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 585269
232 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 42949609951
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 25308
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 58473
243 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 133623
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 9516 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
答案1
由容器/VPS 中的 NFS 服务器(长期报告的错误)引起(无论在容器中还是 VPS 中)。当您将 NFS 从同一物理主机上的容器或 VPS 安装到物理主机并且您的服务器受到一些压力时,NFS 服务器可能会挂起(并且这将扩展到其他进程)。
在这种情况下,这是由内存“压力”引起的。服务器有 50% 的内存被应用程序占用,而几乎有 50% 的内存被文件缓存占用(因此服务器并没有真正加载)。
从同一主机上的另一个容器中挂载 NFS 是可以的。