我在 SSD(OCZ Agility 3 128GB)上安装了 Ubuntu 12.10,主板 P8Z68V_LX 上安装了适度超频的 i5-2500k(4.4GHz)。我认为 SSD 可能有问题。它基本没用,目前只占了 11%。
运行 ruby on rails 时,有时会莫名其妙地出现问题,通常是因为核心库中的某个字母似乎发生了变化。例如,在哈希表中,“S”已更改为“{”,几天后,在 spork 文件中,def 的标题更改为“s{ite”,而它显然应该是“suite”。
Ubuntu 一直遇到许多内部错误,但无法报告这些错误,因此会产生另一个错误来报告这些错误……等等。有时它会抱怨填充不正确。
这不是我的主要工作机器,所以我很想对它进行实验,以了解它是什么。
smartctl 输出:
> sudo smartctl -a /dev/sda
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.0-27-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: OCZ-AGILITY3
Serial Number: OCZ-822QB5MV0QDI394P
LU WWN Device Id: 5 e83a97 e3d1ecf1a
Firmware Version: 2.15
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ACS-2 revision 3
Local Time is: Thu Apr 18 15:40:12 2013 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 1) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 1465) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0021) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 090 090 050 Pre-fail Always - 0/2566041
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 731h+39m+09.960s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 256
171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 68
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 1
181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 030 030 000 Old_age Always - 30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/2566041
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/2566041
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/2566041
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always - 0
233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 481
234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 454
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 454
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 1025
SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
更新:
运行了 Memtest86,一开始它显示第二遍有很多错误,所以我重启并检查了 BIOS 电压,它们都正常。降频到正常速度 3.3GHz,重新检查电压,一切正常。
电压:
CPU : 1.096V
3.3V : 3.344V
5V : 5.000V
12V : 12.096V
重新运行Memtest86一夜之间:
Time 16:23:23 Iterations: 6 AdsrMode:64Bit Pass: 24 Errors:65535+
Error Confidence Value: 50
Lowest Error Address: 00180a73000 - 6154.4MB
Highest Error Address: 001dffffffc - 7679.9MB
Bits in Error Mask: ffffffff
Bits in Error - Total: 32 Min: 1 Max:31 Avg:32768
Max Contiguous Errors: 65535+
根据 MemTest86 上的文档,置信度值高于 100 表示肯定存在内存问题。鉴于置信度值只有 50,我将更换 RAM,看看是 RAM 还是主板的问题。
更新2:
我在 A2 和 B2 之间交换了 2 个 4GB 内存条(这才是它们应该放的位置,而不是 A1 和 B1,那样太直观了)运行了 memtest,6 次测试都没有结果。超频到 4.3GHz,6 次测试还是没有结果。也许我没有正确安装内存条……
更新3:
周末让它运行,发现错误,表明可能是主板有问题:
Time: 61:07:22 Iterations:240 AdrsMode:64Bit Pass: 106 Errors: 65535+
Error Confidence Value: 77
Lowest Error Address : 001c0027000 - 7168.1MB
Highest Error Address : 001dffffffc - 7679.9MB
Bits in Error Mask : ffffffff
Bits in Error - Total : 32 Min: 1 Max: 31 Avg: 32768
Max Contiguous Errors : 65535+
我最好的猜测是,因为地址仍然很高(在使用的插槽之间交换 RAM 后高于 4GB),所以这是主板的问题。
更新3:
将 RAM 放入插槽 A1 和 B1。MemTest 已进行 44 次测试,没有错误。肯定是主板问题 - 其中一个插槽坏了。我不太想责怪华硕主板,可能是运输或我的手笨。
答案1
问题出在主板上,具体来说,使用特定 RAM 插槽时出现内存错误。MemTest86 是一款非常有用的工具!