Ubuntu 12.10 分段错误、崩溃报告损坏、文件中的随机字母更改

Ubuntu 12.10 分段错误、崩溃报告损坏、文件中的随机字母更改

我在 SSD(OCZ Agility 3 128GB)上安装了 Ubuntu 12.10,主板 P8Z68V_LX 上安装了适度超频的 i5-2500k(4.4GHz)。我认为 SSD 可能有问题。它基本没用,目前只占了 11%。

运行 ruby​​ on rails 时,有时会莫名其妙地出现问题,通常是因为核心库中的某个字母似乎发生了变化。例如,在哈希表中,“S”已更改为“{”,几天后,在 spork 文件中,def 的标题更改为“s{ite”,而它显然应该是“suite”。

Ubuntu 一直遇到许多内部错误,但无法报告这些错误,因此会产生另一个错误来报告这些错误……等等。有时它会抱怨填充不正确。

这不是我的主要工作机器,所以我很想对它进行实验,以了解它是什么。

smartctl 输出:

> sudo smartctl -a /dev/sda
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.0-27-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     OCZ-AGILITY3
Serial Number:    OCZ-822QB5MV0QDI394P
LU WWN Device Id: 5 e83a97 e3d1ecf1a
Firmware Version: 2.15
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 3
Local Time is:    Thu Apr 18 15:40:12 2013 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Disabled.
Self-test execution status:      (   1) The previous self-test routine completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection:    ( 1465) seconds.
Offline data collection
capabilities:        (0x7f) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Abort Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    (  48) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x0021) SCT Status supported.
          SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   090   090   050    Pre-fail  Always       -       0/2566041
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       731h+39m+09.960s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       256
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       68
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       1
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       -       30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/2566041
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/2566041
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/2566041
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       481
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       454
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       454
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       1025

SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

更新:

运行了 Memtest86,一开始它显示第二遍有很多错误,所以我重启并检查了 BIOS 电压,它们都正常。降频到正常速度 3.3GHz,重新检查电压,一切正常。

电压:

CPU  : 1.096V
3.3V : 3.344V
5V   : 5.000V
12V  : 12.096V

重新运行Memtest86一夜之间:

Time 16:23:23  Iterations: 6  AdsrMode:64Bit   Pass: 24 Errors:65535+

Error Confidence Value: 50
Lowest Error Address: 00180a73000 - 6154.4MB
Highest Error Address: 001dffffffc - 7679.9MB
Bits in Error Mask: ffffffff
Bits in Error - Total: 32  Min: 1  Max:31  Avg:32768
Max Contiguous Errors: 65535+

根据 MemTest86 上的文档,置信度值高于 100 表示肯定存在内存问题。鉴于置信度值只有 50,我将更换 RAM,看看是 RAM 还是主板的问题。

更新2:

我在 A2 和 B2 之间交换了 2 个 4GB 内存条(这才是它们应该放的位置,而不是 A1 和 B1,那样太直观了)运行了 memtest,6 次测试都没有结果。超频到 4.3GHz,6 次测试还是没有结果。也许我没有正确安装内存条……

更新3:

周末让它运行,发现错误,表明可能是主板有问题:

Time:  61:07:22   Iterations:240   AdrsMode:64Bit   Pass: 106   Errors: 65535+

Error Confidence Value: 77
Lowest Error Address  : 001c0027000 -  7168.1MB
Highest Error Address : 001dffffffc -  7679.9MB
Bits in Error Mask    : ffffffff
Bits in Error - Total : 32  Min: 1  Max: 31  Avg: 32768
Max Contiguous Errors : 65535+

我最好的猜测是,因为地址仍然很高(在使用的插槽之间交换 RAM 后高于 4GB),所以这是主板的问题。

更新3:

将 RAM 放入插槽 A1 和 B1。MemTest 已进行 44 次测试,没有错误。肯定是主板问题 - 其中一个插槽坏了。我不太想责怪华硕主板,可能是运输或我的手笨。

答案1

问题出在主板上,具体来说,使用特定 RAM 插槽时出现内存错误。MemTest86 是一款非常有用的工具!

相关内容