我启动stress-ng -d 9,我的服务器崩溃了

我启动stress-ng -d 9,我的服务器崩溃了

我想知道为什么我的服务器崩溃了。启动几分钟后它会自动重启

stress-ng -d 9

我收到的最新日志如下:

[pid  1547] write(3, "Z\26\260\2273\0Z\346\251\232\311\273e\10\263\6  \376\325(\330O\fG\326\326\330w\344\214t"..., 65536 <unfinished ...>
[pid  1546] write(3, "eT\323a\304\314\300^\25\360\224\224\20\342\6\201!\323\314T\nV\10A\214\25c!\256[\300K"..., 65536 <unfinished ...>
[pid  1545] write(3, "\3135\271\370\264\366\20\307\354\260a\236\337\223,\233u\212\327 a~\37\251\\E\365\217wR\304\200"..., 65536 <unfinished ...>
[pid  1544] write(3, "\357\240\353\341/\345\257\324\205\202&\342\25`\2162\306R\306\275\367\0061\206,ex(T\247S|"..., 65536 <unfinished ...>
[pid  1543] write(3, "\31\345T[a\35\201F\341\343\5\243F\250\23\221r\301\0367\221\3\202\320\310\32\263-\204B\234\32"..., 65536 <unfinished ...>
[pid  1547] <... write resumed> )       = 65536
[pid  1546] <... write resumed> )       = 65536
[pid  1542] write(3, "f;\337\363\340\332)\32nS:\204\254ab\223A\233Z\2\265.j\254\244\324b!p\275Xz"..., 65536 <unfinished ...>
[pid  1541] write(3, "\356\327\\`*\4K\350\

(服务器在最后一行中间崩溃了!)

我检查了 smartctl 并且一切正常:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   100   000    Old_age   Always       -       28
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
166 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3
169 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       33
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       2
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       111
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   050   100   000    Old_age   Always       -       50 (Min/Max 0/52)
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       1
230 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       530
241 Total_LBAs_Written      0x0030   253   253   000    Old_age   Offline      -       489
242 Total_LBAs_Read         0x0030   253   253   000    Old_age   Offline      -       507

磁盘的速度似乎也还可以:

root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24102 MB in 2.00 seconds = 12063.26 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.25 MB/sec
root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24290 MB in 2.00 seconds = 12156.88 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.28 MB/sec

任何想法?

答案1

可能值得从测试中剔除该特定 HDD,例如通过在外部安装的 HDD 上重新运行测试,以查看它是一般的内核问题还是该特定驱动器的问题。-d HDD Stress-ng Stressor 只会用大量通用的读/写模式敲击文件系统,因此令人惊讶的是它会导致这种挂起。因此,我假设这可能是该特定驱动器的问题。

相关内容