CPU 热管理,检测故障行为

CPU 热管理,检测故障行为

CPU 具有根据温度动态降频的功能,以避免过热。在工作中我有两台服务器,其中一台出现了一些不良行为(随机重启)。

下面的代码片段是我在两台机器的系统日志中看到的。这是 CPU 动态频率调整正常运行的结果,还是某种错误的征兆(例如散热膏涂抹不当)?

我期望,像现代 CPU 的动态频率缩放这样平凡的事情不会出现在系统日志中。

附注:在我们运行服务器的任何时候都没有进行过或尝试过超频。

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Jan 31 17:13:12 apollo3 mcelog: Family 6 Model 4f CPU: only decoding architectural errors
Feb  2 15:07:50 apollo3 mcelog: Family 6 Model 4f CPU: only decoding architectural errors
Feb  2 15:07:50 apollo3 mcelog: Hardware event. This is not a software error.
Feb  2 15:07:50 apollo3 mcelog: MCE 0
Feb  2 15:07:50 apollo3 mcelog: CPU 1 THERMAL EVENT TSC 15900247053fc
Feb  2 15:07:50 apollo3 mcelog: TIME 1486044329 Thu Feb  2 15:05:29 2017
Feb  2 15:07:50 apollo3 mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Feb  2 15:07:50 apollo3 mcelog: Please check your system cooling. Performance will be impacted
Feb  2 15:07:50 apollo3 mcelog: STATUS 88000bcb MCGSTATUS 0
Feb  2 15:07:50 apollo3 mcelog: MCGCAP 7000c16 APICID 4 SOCKETID 0
Feb  2 15:07:50 apollo3 mcelog: CPUID Vendor Intel Family 6 Model 79
Feb  2 15:07:50 apollo3 mcelog: Family 6 Model 4f CPU: only decoding architectural errors
Feb  2 15:07:50 apollo3 mcelog: Hardware event. This is not a software error.
Feb  2 15:07:50 apollo3 mcelog: MCE 1
Feb  2 15:07:50 apollo3 mcelog: CPU 1 THERMAL EVENT TSC 15900247241ad
Feb  2 15:07:50 apollo3 mcelog: TIME 1486044329 Thu Feb  2 15:05:29 2017
Feb  2 15:07:50 apollo3 mcelog: Processor 1 below trip temperature. Throttling disabled
Feb  2 15:07:50 apollo3 mcelog: STATUS 88010a8a MCGSTATUS 0
Feb  2 15:07:50 apollo3 mcelog: MCGCAP 7000c16 APICID 4 SOCKETID 0
Feb  2 15:07:50 apollo3 mcelog: CPUID Vendor Intel Family 6 Model 79

答案1

正如它所说-CPU 过热。

  1. 清洁并检查所有风扇是否运转正常

  2. 更换导热膏(如果仍在保修期内,请转到 C)

  3. 如果问题仍然存在,请联系制造商

相关内容