大量一般保护故障

大量一般保护故障

我最近将家用服务器从 Ubuntu 10.04 升级到了 12.04.1。它运行的是 linux-image-server 内核,x86_64 架构。

我认为没有运行任何特别不寻常的东西——deluge 守护进程、apache2、具有 IP 伪装的 iptables 防火墙、DHCP 服务器、bind DNS 服务器(其区域文件使用 DHCP 客户端标识自己的主机名自动更新)、sshd、nfs 服务器,以及一些其他东西。这台机器是我的路由器——它位于互联网和本地网络之间。

自升级以来,它一直间歇性地出现故障。启动后一段时间内一切正常,然后突然间我们会失去 wifi 上的网络连接。如果我插入网线,我无法从 DHCP 服务器获取 IP 地址。如果我为自己设置一个静态 IP 地址,我可以继续正常访问互联网。这看起来像是 DHCP 服务器出现故障(实际上,我运行了,dhclient -v eth0但没有任何响应 dhcpdiscover 呼喊),当客户端尝试续订其 IP 租约时会注意到这一点。但是使用静态 IP 连接后,我仍然可以访问互联网,因此 iptables 仍然运行良好。

所以我尝试通过 SSH 登录到机器,但它似乎挂了。如果我使用 ssh verbose,我发现它确实建立了与服务器的连接,然后稍后失败了——很难看清具体在哪里。

我注意到,如果我尝试从其 HTTP 服务器抓取网页,我会得到我请求的页面,但不会提供任何额外的请求(针对图像、样式表、javascript)。但是,如果我直接请求这些文件,例如从 curl,我可以获得这些文件。

这是否意味着每当有事情试图分叉时,事情就会变得糟糕?

我把显示器和键盘拖到服务器上(通常是无头的)并看了看——我看到了堆栈跟踪。

我切换到一个新的虚拟终端并尝试登录。输入密码后,我得到了一个堆栈跟踪(一般保护错误)。如下所示:

Jan  6 20:19:54 localhost kernel: [ 1475.178245] general protection fault: 0000 [#12] SMP 
Jan  6 20:19:54 localhost kernel: [ 1475.178292] CPU 1 
Jan  6 20:19:54 localhost kernel: [ 1475.178309] Modules linked in: btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc dm_crypt ppdev ipt_REJECT ipt_LOG ipt_MASQUERADE xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables joydev sp5100_tco edac_core i2c_piix4 serio_raw k8temp edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore parport_pc snd_page_alloc mac_hid shpchp lp parport radeon 8139too ttm drm_kms_helper drm pata_atiixp i2c_algo_bit usbhid hid wmi r8169
Jan  6 20:19:54 localhost kernel: [ 1475.178911] 
Jan  6 20:19:54 localhost kernel: [ 1475.178927] Pid: 1305, comm: login Tainted: G    B D      3.2.0-35-generic #55-Ubuntu Gigabyte Technology Co., Ltd. GA-MA785GM-US2H/GA-MA785GM-US2H
Jan  6 20:19:54 localhost kernel: [ 1475.179028] RIP: 0010:[<ffffffff8116589a>]  [<ffffffff8116589a>] kmem_cache_alloc+0x5a/0x140
Jan  6 20:19:54 localhost kernel: [ 1475.179096] RSP: 0018:ffff88006b251d78  EFLAGS: 00010206
Jan  6 20:19:54 localhost kernel: [ 1475.179135] RAX: 0000000000000000 RBX: 00007f062bb91000 RCX: 000000000005b2ed
Jan  6 20:19:54 localhost kernel: [ 1475.179186] RDX: 000000000005b2ec RSI: 0000000000016da0 RDI: ffff88006d408a00
Jan  6 20:19:54 localhost kernel: [ 1475.179236] RBP: ffff88006b251dc8 R08: ffff88006fa96da0 R09: 0000000000000001
Jan  6 20:19:54 localhost kernel: [ 1475.179287] R10: 00000000000000d1 R11: ffff88006b23a8f0 R12: ffff88006d408a00
Jan  6 20:19:54 localhost kernel: [ 1475.179336] R13: 2665c4979a04b7b8 R14: ffffffff811447c5 R15: 00000000000080d0
Jan  6 20:19:54 localhost kernel: [ 1475.179387] FS:  00007f062bb81700(0000) GS:ffff88006fa80000(0000) knlGS:0000000000000000
Jan  6 20:19:54 localhost kernel: [ 1475.179445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  6 20:19:54 localhost kernel: [ 1475.179486] CR2: 00007f9b4d79da00 CR3: 0000000059a34000 CR4: 00000000000006e0
Jan  6 20:19:54 localhost kernel: [ 1475.179536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  6 20:19:54 localhost kernel: [ 1475.179586] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  6 20:19:54 localhost kernel: [ 1475.179637] Process login (pid: 1305, threadinfo ffff88006b250000, task ffff880036058000)
Jan  6 20:19:54 localhost kernel: [ 1475.179695] Stack:
Jan  6 20:19:54 localhost kernel: [ 1475.179711]  ffff880036058000 0000000000000041 0000000000000001 ffffffff81188cec
Jan  6 20:19:54 localhost kernel: [ 1475.179777]  0000000000000282 00007f062bb91000 ffff88006822ce00 0000000000000001
Jan  6 20:19:54 localhost kernel: [ 1475.179841]  0000000000001000 0000000000000000 ffff88006b251e88 ffffffff811447c5
Jan  6 20:19:54 localhost kernel: [ 1475.179905] Call Trace:
Jan  6 20:19:54 localhost kernel: [ 1475.179928]  [<ffffffff81188cec>] ? path_openat+0xfc/0x3f0
Jan  6 20:19:54 localhost kernel: [ 1475.179971]  [<ffffffff811447c5>] mmap_region+0x2a5/0x4f0
Jan  6 20:19:54 localhost kernel: [ 1475.180012]  [<ffffffff81144d58>] do_mmap_pgoff+0x348/0x360
Jan  6 20:19:54 localhost kernel: [ 1475.180054]  [<ffffffff81144e36>] sys_mmap_pgoff+0xc6/0x230
Jan  6 20:19:54 localhost kernel: [ 1475.180098]  [<ffffffff81018b12>] sys_mmap+0x22/0x30
Jan  6 20:19:54 localhost kernel: [ 1475.180136]  [<ffffffff816655c2>] system_call_fastpath+0x16/0x1b
Jan  6 20:19:54 localhost kernel: [ 1475.180180] Code: 00 4d 8b 04 24 65 4c 03 04 25 50 da 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 0f 84 d8 00 00 00 49 63 44 24 20 49 8b 34 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0e 0f 94 c0 84 c0 74 c2 4d 
Jan  6 20:19:54 localhost kernel: [ 1475.180503] RIP  [<ffffffff8116589a>] kmem_cache_alloc+0x5a/0x140
Jan  6 20:19:54 localhost kernel: [ 1475.180552]  RSP <ffff88006b251d78>
Jan  6 20:19:54 localhost kernel: [ 1475.180603] ---[ end trace 766ef1ef52f774b9 ]---

如果我观察的时间足够长,我会看到更多一般保护故障。到目前为止,我已经看到了login它们。apache2deluge-webheadpowerbtn.sh

我必须硬重置机器才能使其恢复工作状态(powerbtn.sh当我按下电源按钮时甚至会出现一般保护故障),但不久之后它又会再次出现这种情况。

我还没有弄清楚如何根据需要重现这种情况——它似乎是随机发生的。

以防万一,我查看了 kern.log 并找到了第一个这样的错误。它们排成一行,以 开头zsh,然后是delugedapache2cronheadconsole-kit-daeirqbalance......nmbd这是zsh第一个错误,紧接着是错误的页面状态错误:

Jan  6 20:13:35 localhost kernel: [ 1096.184250] general protection fault: 0000 [#1] SMP 
Jan  6 20:13:35 localhost kernel: [ 1096.186339] CPU 1 
Jan  6 20:13:35 localhost kernel: [ 1096.186355] Modules linked in: btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc dm_crypt ppdev ipt_REJECT ipt_LOG ipt_MASQUERADE xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables joydev sp5100_tco edac_core i2c_piix4 serio_raw k8temp edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore parport_pc snd_page_alloc mac_hid shpchp lp parport radeon 8139too ttm drm_kms_helper drm pata_atiixp i2c_algo_bit usbhid hid wmi r8169
Jan  6 20:13:35 localhost kernel: [ 1096.188008] 
Jan  6 20:13:35 localhost kernel: [ 1096.188008] Pid: 2564, comm: zsh Not tainted 3.2.0-35-generic #55-Ubuntu Gigabyte Technology Co., Ltd. GA-MA785GM-US2H/GA-MA785GM-US2H
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RIP: 0010:[<ffffffff8116589a>]  [<ffffffff8116589a>] kmem_cache_alloc+0x5a/0x140
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RSP: 0018:ffff880059877d78  EFLAGS: 00010206
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RAX: 0000000000000000 RBX: 00007f202c59d000 RCX: 000000000005b2ed
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RDX: 000000000005b2ec RSI: 0000000000016da0 RDI: ffff88006d408a00
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RBP: ffff880059877dc8 R08: ffff88006fa96da0 R09: 0000000000000001
Jan  6 20:13:35 localhost kernel: [ 1096.188008] R10: 0000000000100073 R11: ffff880059dbb2c0 R12: ffff88006d408a00
Jan  6 20:13:35 localhost kernel: [ 1096.188008] R13: 2665c4979a04b7b8 R14: ffffffff811447c5 R15: 00000000000080d0
Jan  6 20:13:35 localhost kernel: [ 1096.188008] FS:  00007f202c5ac700(0000) GS:ffff88006fa80000(0000) knlGS:0000000000000000
Jan  6 20:13:35 localhost kernel: [ 1096.188008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  6 20:13:35 localhost kernel: [ 1096.188008] CR2: 00000000025991f0 CR3: 0000000059dbc000 CR4: 00000000000006e0
Jan  6 20:13:35 localhost kernel: [ 1096.188008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  6 20:13:35 localhost kernel: [ 1096.188008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  6 20:13:35 localhost kernel: [ 1096.188008] Process zsh (pid: 2564, threadinfo ffff880059876000, task ffff88006b6b5c00)
Jan  6 20:13:35 localhost kernel: [ 1096.188008] Stack:
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  0000000000000001 0000000000001000 0000000000000001 ffffffff8129e2e0
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  0000000000000001 00007f202c59d000 ffff88006822f480 0000000000000001
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  0000000000001000 0000000000000000 ffff880059877e88 ffffffff811447c5
Jan  6 20:13:35 localhost kernel: [ 1096.188008] Call Trace:
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff8129e2e0>] ? cap_vm_enough_memory+0x50/0x60
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff811447c5>] mmap_region+0x2a5/0x4f0
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff81144d58>] do_mmap_pgoff+0x348/0x360
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff81144eb1>] sys_mmap_pgoff+0x141/0x230
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff81018b12>] sys_mmap+0x22/0x30
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  [<ffffffff816655c2>] system_call_fastpath+0x16/0x1b
Jan  6 20:13:35 localhost kernel: [ 1096.188008] Code: 00 4d 8b 04 24 65 4c 03 04 25 50 da 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 0f 84 d8 00 00 00 49 63 44 24 20 49 8b 34 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0e 0f 94 c0 84 c0 74 c2 4d 
Jan  6 20:13:35 localhost kernel: [ 1096.188008] RIP  [<ffffffff8116589a>] kmem_cache_alloc+0x5a/0x140
Jan  6 20:13:35 localhost kernel: [ 1096.188008]  RSP <ffff880059877d78>
Jan  6 20:13:35 localhost kernel: [ 1096.274513] ---[ end trace 766ef1ef52f774ae ]---
Jan  6 20:13:37 localhost kernel: [ 1097.836149] BUG: Bad page state in process swapper/0  pfn:59a33
Jan  6 20:13:37 localhost kernel: [ 1097.838885] page:ffffea0001668cc0 count:0 mapcount:-1 mapping:          (null) index:0xffff880059a33160
Jan  6 20:13:37 localhost kernel: [ 1097.841673] page flags: 0x100000000000000()
Jan  6 20:13:37 localhost kernel: [ 1097.844440] Modules linked in: btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc dm_crypt ppdev ipt_REJECT ipt_LOG ipt_MASQUERADE xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables joydev sp5100_tco edac_core i2c_piix4 serio_raw k8temp edac_mce_amd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore parport_pc snd_page_alloc mac_hid shpchp lp parport radeon 8139too ttm drm_kms_helper drm pata_atiixp i2c_algo_bit usbhid hid wmi r8169
Jan  6 20:13:37 localhost kernel: [ 1097.856881] Pid: 0, comm: swapper/0 Tainted: G      D      3.2.0-35-generic #55-Ubuntu
Jan  6 20:13:37 localhost kernel: [ 1097.860020] Call Trace:
Jan  6 20:13:37 localhost kernel: [ 1097.863063]  <IRQ>  [<ffffffff8111fe8f>] bad_page.part.61+0x9f/0xf0
Jan  6 20:13:37 localhost kernel: [ 1097.866119]  [<ffffffff8111fef8>] bad_page+0x18/0x30
Jan  6 20:13:37 localhost kernel: [ 1097.869158]  [<ffffffff8112098e>] free_pages_prepare+0x10e/0x120
Jan  6 20:13:37 localhost kernel: [ 1097.872178]  [<ffffffff81120af9>] free_hot_cold_page+0x49/0x1a0
Jan  6 20:13:37 localhost kernel: [ 1097.875183]  [<ffffffff81120c7d>] __free_pages+0x2d/0x40
Jan  6 20:13:37 localhost kernel: [ 1097.878163]  [<ffffffff8159a8fb>] tcp_v4_destroy_sock+0x25b/0x2c0
Jan  6 20:13:37 localhost kernel: [ 1097.881105]  [<ffffffff81582695>] inet_csk_destroy_sock+0x55/0x140
Jan  6 20:13:37 localhost kernel: [ 1097.883970]  [<ffffffff815849b0>] tcp_done+0x50/0x90
Jan  6 20:13:37 localhost kernel: [ 1097.886853]  [<ffffffff81591d92>] tcp_rcv_state_process+0x422/0x5f0
Jan  6 20:13:37 localhost kernel: [ 1097.889724]  [<ffffffff8159a597>] tcp_v4_do_rcv+0xc7/0x1d0
Jan  6 20:13:37 localhost kernel: [ 1097.892513]  [<ffffffff8159c1f1>] tcp_v4_rcv+0x581/0x820
Jan  6 20:13:37 localhost kernel: [ 1097.895301]  [<ffffffff81577b60>] ? ip_rcv_finish+0x370/0x370
Jan  6 20:13:37 localhost kernel: [ 1097.898110]  [<ffffffff81577b60>] ? ip_rcv_finish+0x370/0x370
Jan  6 20:13:37 localhost kernel: [ 1097.900915]  [<ffffffff81577c3d>] ip_local_deliver_finish+0xdd/0x280
Jan  6 20:13:37 localhost kernel: [ 1097.903716]  [<ffffffff81577fa8>] ip_local_deliver+0x88/0x90
Jan  6 20:13:37 localhost kernel: [ 1097.906502]  [<ffffffff815778fd>] ip_rcv_finish+0x10d/0x370
Jan  6 20:13:37 localhost kernel: [ 1097.909279]  [<ffffffff815781e5>] ip_rcv+0x235/0x300
Jan  6 20:13:37 localhost kernel: [ 1097.912067]  [<ffffffff81613dc7>] ? packet_rcv_spkt+0x47/0x190
Jan  6 20:13:37 localhost kernel: [ 1097.914831]  [<ffffffff81543446>] __netif_receive_skb+0x4d6/0x550
Jan  6 20:13:37 localhost kernel: [ 1097.917624]  [<ffffffff81544230>] netif_receive_skb+0x80/0x90
Jan  6 20:13:37 localhost kernel: [ 1097.920415]  [<ffffffff81536474>] ? __netdev_alloc_skb+0x24/0x50
Jan  6 20:13:37 localhost kernel: [ 1097.923124]  [<ffffffffa00d6e90>] rtl8139_rx+0x150/0x2b0 [8139too]
Jan  6 20:13:37 localhost kernel: [ 1097.925754]  [<ffffffffa00d704a>] rtl8139_poll+0x5a/0xd0 [8139too]
Jan  6 20:13:37 localhost kernel: [ 1097.928274]  [<ffffffff81544bd4>] net_rx_action+0x134/0x290
Jan  6 20:13:37 localhost kernel: [ 1097.930698]  [<ffffffff8103df8b>] ? native_safe_halt+0xb/0x10
Jan  6 20:13:37 localhost kernel: [ 1097.933115]  [<ffffffff8106f6e8>] __do_softirq+0xa8/0x210
Jan  6 20:13:37 localhost kernel: [ 1097.935495]  [<ffffffff810967f5>] ? do_timer+0x25/0x30
Jan  6 20:13:37 localhost kernel: [ 1097.937836]  [<ffffffff81035dc2>] ? ack_apic_level+0x72/0x190
Jan  6 20:13:37 localhost kernel: [ 1097.940163]  [<ffffffff8166782c>] call_softirq+0x1c/0x30
Jan  6 20:13:37 localhost kernel: [ 1097.942464]  [<ffffffff81016305>] do_softirq+0x65/0xa0
Jan  6 20:13:37 localhost kernel: [ 1097.944778]  [<ffffffff8106face>] irq_exit+0x8e/0xb0
Jan  6 20:13:37 localhost kernel: [ 1097.947068]  [<ffffffff816680e3>] do_IRQ+0x63/0xe0
Jan  6 20:13:37 localhost kernel: [ 1097.949327]  [<ffffffff8165d46e>] common_interrupt+0x6e/0x6e
Jan  6 20:13:37 localhost kernel: [ 1097.951597]  <EOI>  [<ffffffff8103df8b>] ? native_safe_halt+0xb/0x10
Jan  6 20:13:37 localhost kernel: [ 1097.953891]  [<ffffffff810900a8>] ? hrtimer_start+0x18/0x20
Jan  6 20:13:37 localhost kernel: [ 1097.956171]  [<ffffffff8101c983>] default_idle+0x53/0x1d0
Jan  6 20:13:37 localhost kernel: [ 1097.958426]  [<ffffffff8101cb5d>] amd_e400_idle+0x5d/0x120
Jan  6 20:13:37 localhost kernel: [ 1097.960704]  [<ffffffff81013236>] cpu_idle+0xd6/0x120
Jan  6 20:13:37 localhost kernel: [ 1097.962970]  [<ffffffff816235ee>] rest_init+0x72/0x74
Jan  6 20:13:37 localhost kernel: [ 1097.965195]  [<ffffffff81cfbc03>] start_kernel+0x3b0/0x3bd
Jan  6 20:13:37 localhost kernel: [ 1097.967421]  [<ffffffff81cfb388>] x86_64_start_reservations+0x132/0x136
Jan  6 20:13:37 localhost kernel: [ 1097.969660]  [<ffffffff81cfb140>] ? early_idt_handlers+0x140/0x140
Jan  6 20:13:37 localhost kernel: [ 1097.971888]  [<ffffffff81cfb459>] x86_64_start_kernel+0xcd/0xdc

这是怎么回事?我该怎么办?

答案1

看来确实是内存问题。Memtest 在四个模块中的一个上出现了一些错误,移除这个模块后它就再也没有崩溃过。感谢大家的建议。

答案2

我会先使用较旧的内核 (2.6.x),然后再安装硬件。如果仍安装有 Ubuntu 10 内核,请重新启动计算机并使用旧内核运行服务器。Ubuntu 12 是 3.x?而 Ubuntu 10 是 2.6.x。

如果 2.6 linux-image 不可用,您可以将 lucid 存储库添加到 /etc/apt/sources.list.d,执行“apt-get update”和“aptitude version linux-image”并安装 2.6 内核。

如果使用旧内核后情况没有改变,则可以断定问题不是内核而是硬件。如果情况有所改善,则可能是驱动程序或内核存在错误。据我所知,使用旧内核不会影响您的系统。我安装了 Lucid 的 2.6 内核并将其与 Precise 一起使用,以避免英特尔图形问题,我的机器运行良好。

唯一令人讨厌的是,我必须破解“grub”(/etc/grub.d/)以便 2.6 内核能够显示在 grub 菜单中,这样我就可以编辑 /etc/default/grub 以选择 2.6 内核作为默认内核,并且我必须编辑 /etc/grub.d/自从我每次更新 linux-image 以来,它都会多次恢复 /etc/grub.d/* 中被黑的文件。(也许其他人知道如何在 grub 中使用旧版本的内核作为默认版本。)

相关内容