我想知道为什么我的服务器崩溃了。启动几分钟后它会自动重启
stress-ng -d 9
我收到的最新日志如下:
[pid 1547] write(3, "Z\26\260\2273\0Z\346\251\232\311\273e\10\263\6 \376\325(\330O\fG\326\326\330w\344\214t"..., 65536 <unfinished ...>
[pid 1546] write(3, "eT\323a\304\314\300^\25\360\224\224\20\342\6\201!\323\314T\nV\10A\214\25c!\256[\300K"..., 65536 <unfinished ...>
[pid 1545] write(3, "\3135\271\370\264\366\20\307\354\260a\236\337\223,\233u\212\327 a~\37\251\\E\365\217wR\304\200"..., 65536 <unfinished ...>
[pid 1544] write(3, "\357\240\353\341/\345\257\324\205\202&\342\25`\2162\306R\306\275\367\0061\206,ex(T\247S|"..., 65536 <unfinished ...>
[pid 1543] write(3, "\31\345T[a\35\201F\341\343\5\243F\250\23\221r\301\0367\221\3\202\320\310\32\263-\204B\234\32"..., 65536 <unfinished ...>
[pid 1547] <... write resumed> ) = 65536
[pid 1546] <... write resumed> ) = 65536
[pid 1542] write(3, "f;\337\363\340\332)\32nS:\204\254ab\223A\233Z\2\265.j\254\244\324b!p\275Xz"..., 65536 <unfinished ...>
[pid 1541] write(3, "\356\327\\`*\4K\350\
(服务器在最后一行中间崩溃了!)
我检查了 smartctl 并且一切正常:
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 028 100 000 Old_age Always - 28
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 112
166 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1
167 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 3
169 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 33
170 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0032 100 100 --- Old_age Always - 2
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 111
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 050 100 000 Old_age Always - 50 (Min/Max 0/52)
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1
230 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 0
232 Available_Reservd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 530
241 Total_LBAs_Written 0x0030 253 253 000 Old_age Offline - 489
242 Total_LBAs_Read 0x0030 253 253 000 Old_age Offline - 507
磁盘的速度似乎也还可以:
root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24102 MB in 2.00 seconds = 12063.26 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.25 MB/sec
root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24290 MB in 2.00 seconds = 12156.88 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.28 MB/sec
任何想法?
答案1
可能值得从测试中剔除该特定 HDD,例如通过在外部安装的 HDD 上重新运行测试,以查看它是一般的内核问题还是该特定驱动器的问题。-d HDD Stress-ng Stressor 只会用大量通用的读/写模式敲击文件系统,因此令人惊讶的是它会导致这种挂起。因此,我假设这可能是该特定驱动器的问题。