我的服务器在不同的工具中将频繁的分段错误记录到 /var/log/kern.log 中。到目前为止,我在 Perl、PHP 和 rsync 中都看到过它们。所有安装的软件都是最新的 Debian 软件包。以下是日志文件的摘录:
Mar 2 01:07:54 gaz kernel: [ 5316.246303] imapsync[4533]: segfault at 8b ip 00007fb448c98fe6 sp 00007ffff571dd68 error 4 in libperl.so.5.10.1[7fb448bd7000+164000]
Mar 2 01:17:42 gaz kernel: [ 5904.354307] php5-cgi[4441]: segfault at 2bb3dc8 ip 0000000002bb3dc8 sp 00007fffbeeaae48 error 15
Mar 2 02:54:05 gaz kernel: [11687.922316] php5-cgi[4495]: segfault at 2d7acf9 ip 0000000002d7acf9 sp 00007fff60c6eb18 error 15
Mar 2 10:50:08 gaz kernel: [40250.390322] BUG: unable to handle kernel paging request at 00000000024b03f0
Mar 2 10:50:08 gaz kernel: [40250.390341] IP: [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390353] PGD 208c71067 PUD 21c811067 PMD 209329067 PTE 8000000211c88067
Mar 2 10:50:08 gaz kernel: [40250.390365] Oops: 0011 [#1] SMP
Mar 2 10:50:08 gaz kernel: [40250.390373] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar 2 10:50:08 gaz kernel: [40250.390386] CPU 1
Mar 2 10:50:08 gaz kernel: [40250.390392] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_
ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd
_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_m
ce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod
sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar 2 10:50:08 gaz kernel: [40250.390566] Pid: 11482, comm: munin-limits Not tainted 2.6.32-5-amd64 #1 MS-7368
Mar 2 10:50:08 gaz kernel: [40250.390576] RIP: 0010:[<00000000024b03f0>] [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390586] RSP: 0018:ffff88021cc8dec0 EFLAGS: 00010286
Mar 2 10:50:08 gaz kernel: [40250.390593] RAX: 000000001ddc1000 RBX: 0000000000000010 RCX: ffffffff810f9904
Mar 2 10:50:08 gaz kernel: [40250.390600] RDX: 0000000000000000 RSI: ffffea0007688200 RDI: 0000000000000286
Mar 2 10:50:08 gaz kernel: [40250.390608] RBP: 00000000ffffffea R08: 0000000000000025 R09: 7865542f30312e35
Mar 2 10:50:08 gaz kernel: [40250.390615] R10: 000000d01cc8ddf8 R11: 0000000000000246 R12: ffff88021cc8def8
Mar 2 10:50:08 gaz kernel: [40250.390622] R13: 0000000002295010 R14: 00000000022c9db0 R15: 0000000002488d78
Mar 2 10:50:08 gaz kernel: [40250.390630] FS: 00007f3b3c8b2700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar 2 10:50:08 gaz kernel: [40250.390641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 2 10:50:08 gaz kernel: [40250.390648] CR2: 00000000024b03f0 CR3: 000000021c5d1000 CR4: 00000000000006e0
Mar 2 10:50:08 gaz kernel: [40250.390656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 2 10:50:08 gaz kernel: [40250.390663] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 2 10:50:08 gaz kernel: [40250.390671] Process munin-limits (pid: 11482, threadinfo ffff88021cc8c000, task ffff88021bf59530)
Mar 2 10:50:08 gaz kernel: [40250.390681] Stack:
Mar 2 10:50:08 gaz kernel: [40250.390687] ffffffff810f1d4a ffff880208c63228 0000000000000000 00007fffc2dcecc0
Mar 2 10:50:08 gaz kernel: [40250.390697] <0> 00000000024ba2b0 0000000002295010 ffffffff810f1e3d 0000000000000004
Mar 2 10:50:08 gaz kernel: [40250.390712] <0> ffff88021bf59530 ffff88021c4edc00 ffffffff812fe0b6 ffff88021c4edc60
Mar 2 10:50:08 gaz kernel: [40250.390732] Call Trace:
Mar 2 10:50:08 gaz kernel: [40250.390742] [<ffffffff810f1d4a>] ? vfs_fstatat+0x2c/0x57
Mar 2 10:50:08 gaz kernel: [40250.390750] [<ffffffff810f1e3d>] ? sys_newstat+0x11/0x30
Mar 2 10:50:08 gaz kernel: [40250.390760] [<ffffffff812fe0b6>] ? do_page_fault+0x2e0/0x2fc
Mar 2 10:50:08 gaz kernel: [40250.390768] [<ffffffff812fbf55>] ? page_fault+0x25/0x30
Mar 2 10:50:08 gaz kernel: [40250.390777] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar 2 10:50:08 gaz kernel: [40250.390783] Code: Bad RIP value.
Mar 2 10:50:08 gaz kernel: [40250.390791] RIP [<00000000024b03f0>] 0x24b03f0
Mar 2 10:50:08 gaz kernel: [40250.390799] RSP <ffff88021cc8dec0>
Mar 2 10:50:08 gaz kernel: [40250.390805] CR2: 00000000024b03f0
Mar 2 10:50:08 gaz kernel: [40250.391051] ---[ end trace 1cc1473b539c7f6e ]---
Mar 2 11:42:20 gaz kernel: [43382.242301] php5-cgi[10963]: segfault at d81160 ip 0000000000d81160 sp 00007fff3adcb058 error 15
Mar 2 21:51:14 gaz kernel: [79916.418302] php5-cgi[20089]: segfault at 1c59dc8 ip 0000000001c59dc8 sp 00007fff9b877fb8 error 15
Mar 3 03:45:01 gaz kernel: [101143.334305] munin-update[22519] general protection ip:7f516dce204c sp:7fff6049a978 error:0 in libperl.so.5.10.1[7f516dc7d000+164000]
Mar 3 11:22:37 gaz kernel: [128599.570307] php5-cgi[22888]: segfault at 36485a8 ip 00000000036485a8 sp 00007fff2d56e1c8 error 15
Mar 4 08:32:17 gaz kernel: [204779.842304] php5-cgi[22090]: segfault at 18 ip 0000000000689e5e sp 00007fff677a6a48 error 6 in php5-cgi[400000+6f9000]
Mar 4 10:01:02 gaz kernel: [210104.434706] rsync[22236] general protection ip:7f14a07137f9 sp:7fff88f940b8 error:0 in libc-2.11.2.so[7f14a069d000+158000]
Mar 4 11:32:22 gaz kernel: [215584.262316] BUG: unable to handle kernel paging request at 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262331] IP: [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262343] PGD 0
Mar 4 11:32:22 gaz kernel: [215584.262350] Oops: 0010 [#2] SMP
Mar 4 11:32:22 gaz kernel: [215584.262359] last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host4/target4:0:0/4:0:0:0/block/sdb/stat
Mar 4 11:32:22 gaz kernel: [215584.262371] CPU 1
Mar 4 11:32:22 gaz kernel: [215584.262378] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative xt_recent xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter ip6_tables xt_DSCP xt_TCPMSS ipt_LOG ipt_REJECT iptable_mangle iptable_filter xt_multiport xt_state xt_limit xt_conntrack nf_conntrack_ftp nf_conntrack ip_tables x_tables loop snd_hda_codec_atihdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm radeon snd_timer ttm snd drm_kms_helper soundcore drm snd_page_alloc i2c_algo_bit shpchp i2c_piix4 edac_core pcspkr k8temp evdev edac_mce_amd pci_hotplug i2c_core button ext3 jbd mbcache dm_mod powernow_k8 aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 md_mod sata_nv sata_sil sata_via sd_mod crc_t10dif ata_generic ahci pata_atiixp ohci_hcd libata r8169 mii thermal ehci_hcd processor thermal_sys scsi_mod usbcore nls_base [last unloaded: scsi_wait_scan]
Mar 4 11:32:22 gaz kernel: [215584.262552] Pid: 1960, comm: proxymap Tainted: G D 2.6.32-5-amd64 #1 MS-7368
Mar 4 11:32:22 gaz kernel: [215584.262563] RIP: 0010:[<00000000ffffff9c>] [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262573] RSP: 0018:ffff880209257e00 EFLAGS: 00010212
Mar 4 11:32:22 gaz kernel: [215584.262580] RAX: ffff8801514eb780 RBX: ffffffff810efb2d RCX: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262590] RDX: 0000000000000020 RSI: 0000000000000001 RDI: ffff8801514eb780
Mar 4 11:32:22 gaz kernel: [215584.262600] RBP: 00000000ffffffe9 R08: 0000000000000000 R09: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262611] R10: ffff880209257e78 R11: ffffffff81152c7c R12: 0000000000000001
Mar 4 11:32:22 gaz kernel: [215584.262622] R13: 0000000000008001 R14: 0000000000000024 R15: 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262633] FS: 00007fca4de35700(0000) GS:ffff880008d00000(0000) knlGS:0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 4 11:32:22 gaz kernel: [215584.262650] CR2: 00000000ffffff9c CR3: 00000001c9cbb000 CR4: 00000000000006e0
Mar 4 11:32:22 gaz kernel: [215584.262661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 4 11:32:22 gaz kernel: [215584.262671] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 4 11:32:22 gaz kernel: [215584.262682] Process proxymap (pid: 1960, threadinfo ffff880209256000, task ffff88021c4b1c40)
Mar 4 11:32:22 gaz kernel: [215584.262693] Stack:
Mar 4 11:32:22 gaz kernel: [215584.262698] ffffffff810f8566 ffff880209257e78 ffff88021c7bf000 ffff88021c7bf0c8
Mar 4 11:32:22 gaz kernel: [215584.262709] <0> 0000800000000000 ffff88021fc0f000 ffff880209257e78 00000000fffffffe
Mar 4 11:32:22 gaz kernel: [215584.262724] <0> ffffffff810e5881 ffff880209257f48 0000000000000286 ffff88021fc0f000
Mar 4 11:32:22 gaz kernel: [215584.262743] Call Trace:
Mar 4 11:32:22 gaz kernel: [215584.262753] [<ffffffff810f8566>] ? do_filp_open+0xa7/0x94b
Mar 4 11:32:22 gaz kernel: [215584.262763] [<ffffffff810e5881>] ? virt_to_head_page+0x9/0x2a
Mar 4 11:32:22 gaz kernel: [215584.262771] [<ffffffff810f9904>] ? user_path_at+0x52/0x79
Mar 4 11:32:22 gaz kernel: [215584.262779] [<ffffffff810cfec1>] ? get_unmapped_area+0xd7/0x139
Mar 4 11:32:22 gaz kernel: [215584.262787] [<ffffffff811019d5>] ? alloc_fd+0x67/0x10c
Mar 4 11:32:22 gaz kernel: [215584.262795] [<ffffffff810eceaf>] ? do_sys_open+0x55/0xfc
Mar 4 11:32:22 gaz kernel: [215584.262804] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Mar 4 11:32:22 gaz kernel: [215584.262811] Code: Bad RIP value.
Mar 4 11:32:22 gaz kernel: [215584.262819] RIP [<00000000ffffff9c>] 0xffffff9c
Mar 4 11:32:22 gaz kernel: [215584.262828] RSP <ffff880209257e00>
Mar 4 11:32:22 gaz kernel: [215584.262833] CR2: 00000000ffffff9c
Mar 4 11:32:22 gaz kernel: [215584.263077] ---[ end trace 1cc1473b539c7f6f ]---
如您所见,存在段错误、一般保护错误和内核错误。我的第一个猜测是存在某种硬件问题,于是我要求我的主机提供商(它是一台租用的根服务器)进行全面的硬件检查 - 他们确实进行了检查,但没有发现任何问题。
我不知道他们检查了什么以及如何检查,但他们的支持团队通常都非常好。我自己运行了 memtester 和 cpuburn,也没有发现任何错误。
不幸的是,我没有可靠的方法来重现这些段错误,它们似乎或多或少是随机的。我凭直觉禁用了系统的防火墙,并运行了一个经常发生段错误的程序(imapsync),它似乎比以前发生段错误所需的时间更长,所以问题可能与网络堆栈有关。或者可能只是随机的事情。
以下是内核规格:
# uname -a
Linux gaz 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux
# cat /etc/debian_version
6.0
# lsmod
Module Size Used by
cpufreq_userspace 1992 0
cpufreq_stats 2659 0
cpufreq_powersave 902 0
cpufreq_conservative 5162 0
xt_recent 5977 0
xt_tcpudp 2319 0
iptable_nat 4299 0
nf_nat 13388 1 iptable_nat
nf_conntrack_ipv4 9833 3 iptable_nat,nf_nat
nf_defrag_ipv4 1139 1 nf_conntrack_ipv4
ip6table_filter 2384 0
ip6_tables 15075 1 ip6table_filter
xt_DSCP 1995 0
xt_TCPMSS 2919 0
ipt_LOG 4518 0
ipt_REJECT 1953 0
iptable_mangle 2817 0
iptable_filter 2258 0
xt_multiport 2267 0
xt_state 1303 0
xt_limit 1782 0
xt_conntrack 2407 0
nf_conntrack_ftp 5537 0
nf_conntrack 46535 6 iptable_nat,nf_nat,nf_conntrack_ipv4,xt_state,xt_conntrack,nf_conntrack_ftp
ip_tables 13899 3 iptable_nat,iptable_mangle,iptable_filter
x_tables 12845 13 xt_recent,xt_tcpudp,iptable_nat,ip6_tables,xt_DSCP,xt_TCPMSS,ipt_LOG,ipt_REJECT,xt_multiport,xt_state,xt_limit,xt_conntrack,ip_tables
loop 11799 0
radeon 573996 0
ttm 39986 1 radeon
drm_kms_helper 20065 1 radeon
snd_hda_codec_atihdmi 2251 1
drm 142359 3 radeon,ttm,drm_kms_helper
snd_hda_intel 20019 0
i2c_algo_bit 4225 1 radeon
pcspkr 1699 0
i2c_piix4 8328 0
snd_hda_codec 54244 2 snd_hda_codec_atihdmi,snd_hda_intel
i2c_core 15712 5 radeon,drm_kms_helper,drm,i2c_algo_bit,i2c_piix4
snd_hwdep 5380 1 snd_hda_codec
snd_pcm 60503 2 snd_hda_intel,snd_hda_codec
snd_timer 15582 1 snd_pcm
snd 46446 5 snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer
soundcore 4598 1 snd
evdev 7352 3
snd_page_alloc 6249 2 snd_hda_intel,snd_pcm
k8temp 3283 0
edac_core 29261 0
edac_mce_amd 6433 0
shpchp 26264 0
pci_hotplug 21203 1 shpchp
button 4650 0
ext3 106518 2
jbd 37085 1 ext3
mbcache 5050 1 ext3
dm_mod 53754 0
powernow_k8 10978 1
aacraid 59779 0
3w_9xxx 28684 0
3w_xxxx 20569 0
raid10 17809 0
raid456 44500 0
async_raid6_recov 5170 1 raid456
async_pq 3479 2 raid456,async_raid6_recov
raid6_pq 77179 2 async_raid6_recov,async_pq
async_xor 2478 3 raid456,async_raid6_recov,async_pq
xor 4380 1 async_xor
async_memcpy 1198 2 raid456,async_raid6_recov
async_tx 1734 5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy
raid1 18431 3
raid0 5517 0
md_mod 73824 7 raid10,raid456,raid1,raid0
sata_nv 19166 0
sata_sil 7412 0
sata_via 7928 0
sd_mod 29889 8
crc_t10dif 1276 1 sd_mod
ata_generic 3047 0
ahci 32374 6
r8169 29229 0
mii 3210 1 r8169
thermal 11674 0
pata_atiixp 3489 0
libata 133632 6 sata_nv,sata_sil,sata_via,ata_generic,ahci,pata_atiixp
ohci_hcd 19212 0
ehci_hcd 31151 0
processor 29935 1 powernow_k8
thermal_sys 11942 2 thermal,processor
scsi_mod 122149 5 aacraid,3w_9xxx,3w_xxxx,sd_mod,libata
usbcore 122034 3 ohci_hcd,ehci_hcd
nls_base 6377 1 usbcore
# free
total used free shared buffers cached
Mem: 8166128 1228036 6938092 0 140412 782060
-/+ buffers/cache: 305564 7860564
Swap: 2102456 0 2102456
所以,基本上我的问题是:
- 我该如何进一步诊断这个问题?
- 上面的日志中是否有任何数据可以帮助我找出麻烦制造者?
- 当我用谷歌搜索上述硬件/软件时,是否存在我忽略的已知问题?
- 有没有办法阻止内核自动加载模块(我可能不需要所有这些模块,其中一个可能是罪魁祸首)
答案1
检查你的记忆!
此类随机段错误最常见的原因是内存不足。使用内存检查器(例如memtest86+) 并进行测试。
答案2
首先要检查... 检查服务器有多少内存。检查交换分区的大小。检查其他日志文件以查找潜在信息来源(系统日志)。检查内核版本和当前硬件(或虚拟化系统)是否存在已知问题。我在小型(vmware)虚拟机中运行使用此内核的 Debian 6,没有任何问题。
答案3
我要检查的一件事是,您的托管服务提供商是否使用所谓的“突发 RAM”。廉价托管通常有一些基本 RAM,可以临时扩展。这种临时扩展的 RAM 的问题在于您不能依赖它,因为它可能会在计算过程中被拿走,从而导致段错误。