Ubuntu 内核跟踪 - 解决“任务挂起 120 秒”随机崩溃问题

Ubuntu 内核跟踪 - 解决“任务挂起 120 秒”随机崩溃问题

我终于能够从“任务挂起 120 秒”错误中获得跟踪。我们的服务器会在负载峰值时随机崩溃。IO 等待时间似乎很正常。内存/交换很好。这些是基于 Xen 的 Ubuntu 10.04 - 64 位。VPS,有些甚至是“裸机”。

以下是踪迹:

Aug 24 11:13:25 admin1 kernel: [2732107.501749] INFO: task java:29952 blocked for more than 120 seconds.
Aug 24 11:13:25 admin1 kernel: [2732107.501954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 11:13:25 admin1 kernel: [2732107.502223] java          D 0000000000000000     0 29952      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.502227]  ffff8801d3b5fbb8 0000000000000082 ffffffff81a35800 ffff8801d635be8c
Aug 24 11:13:25 admin1 kernel: [2732107.502230]  0000000000014cc0 ffff88014fcc5b80 0000000000014cc0 ffff88014fcc5f18
Aug 24 11:13:25 admin1 kernel: [2732107.502233]  ffff8801d3b5ffd8 ffff88014fcc5f20 ffff8801d3b5e010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.502236] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.502243]  [<ffffffff81589bd7>] __mutex_lock_slowpath+0xf7/0x180
Aug 24 11:13:25 admin1 kernel: [2732107.502247]  [<ffffffff81589abb>] mutex_lock+0x2b/0x50
Aug 24 11:13:25 admin1 kernel: [2732107.502251]  [<ffffffff8122633a>] __log_wait_for_space+0xda/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.502255]  [<ffffffff814fed2f>] ? inet_recvmsg+0x4f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.502258]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.502260]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502263]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502266]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.502269]  [<ffffffff811ca63d>] ext3_dirty_inode+0x3d/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502273]  [<ffffffff811776ca>] __mark_inode_dirty+0x3a/0x1f0
Aug 24 11:13:25 admin1 kernel: [2732107.502275]  [<ffffffff811ca7f5>] ext3_setattr+0x155/0x2c0
Aug 24 11:13:25 admin1 kernel: [2732107.502279]  [<ffffffff810097ce>] ? __switch_to+0x26e/0x320
Aug 24 11:13:25 admin1 kernel: [2732107.502283]  [<ffffffff8116c2c5>] notify_change+0x175/0x2f0
Aug 24 11:13:25 admin1 kernel: [2732107.502285]  [<ffffffff81588781>] ? schedule+0x411/0xa50
Aug 24 11:13:25 admin1 kernel: [2732107.502289]  [<ffffffff81153064>] do_truncate+0x64/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502291]  [<ffffffff81153d7b>] ? generic_file_llseek+0x5b/0x70
Aug 24 11:13:25 admin1 kernel: [2732107.502294]  [<ffffffff811532d8>] sys_ftruncate+0xe8/0xf0
Aug 24 11:13:25 admin1 kernel: [2732107.502297]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

另一个就在现在,当运行'unzip'时 - 与上面的跟踪时间完全相同;

Aug 24 11:13:25 admin1 kernel: [2732107.503946] unzip         D 0000000000000000     0 28123      1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.503951]  ffff880031e156e8 0000000000000086 0000000000000000 0000000000000000
Aug 24 11:13:25 admin1 kernel: [2732107.503954]  0000000000014cc0 ffff8801183f5b80 0000000000014cc0 ffff8801183f5f18
Aug 24 11:13:25 admin1 kernel: [2732107.503957]  ffff880031e15fd8 ffff8801183f5f20 ffff880031e14010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.503960] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.503964]  [<ffffffff81227745>] log_wait_commit+0xc5/0x150
Aug 24 11:13:25 admin1 kernel: [2732107.503967]  [<ffffffff81081150>] ? autoremove_wake_function+0x0/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.503969]  [<ffffffff8158b08e>] ? _raw_spin_lock+0xe/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503972]  [<ffffffff81225ddb>] log_do_checkpoint+0xab/0x530
Aug 24 11:13:25 admin1 kernel: [2732107.503976]  [<ffffffff81036bd6>] ? __ticket_spin_lock+0x16/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503979]  [<ffffffff81226306>] __log_wait_for_space+0xa6/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.503982]  [<ffffffff81070974>] ? mod_timer+0x174/0x2d0
Aug 24 11:13:25 admin1 kernel: [2732107.503985]  [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.503987]  [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503990]  [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503993]  [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.503995]  [<ffffffff811ccbe2>] ext3_write_begin+0xa2/0x230
Aug 24 11:13:25 admin1 kernel: [2732107.503998]  [<ffffffff8110029e>] generic_file_buffered_write+0xfe/0x250
Aug 24 11:13:25 admin1 kernel: [2732107.504001]  [<ffffffff811dc65f>] ? ext3_xattr_security_get+0x2f/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.504004]  [<ffffffff81102df0>] __generic_file_aio_write+0x240/0x470
Aug 24 11:13:25 admin1 kernel: [2732107.504006]  [<ffffffff81103085>] generic_file_aio_write+0x65/0xd0
Aug 24 11:13:25 admin1 kernel: [2732107.504009]  [<ffffffff8115370a>] do_sync_write+0xda/0x120
Aug 24 11:13:25 admin1 kernel: [2732107.504012]  [<ffffffff812941a8>] ? apparmor_file_permission+0x18/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.504015]  [<ffffffff8126366f>] ? security_file_permission+0x1f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.504017]  [<ffffffff811539f8>] vfs_write+0xc8/0x190
Aug 24 11:13:25 admin1 kernel: [2732107.504020]  [<ffffffff81154251>] sys_write+0x51/0x90
Aug 24 11:13:25 admin1 kernel: [2732107.504023]  [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b

有人可以帮忙吗?

这是文件系统的死锁吗?

蒂娅。

答案1

这些堆栈跟踪都表明当时正在尝试文件系统写入操作,因此请检查存储相关问题。如果您有网络挂载的文件系统,请检查服务器是否可以访问且正常运行。如果没有,请检查您的磁盘驱动器。在dmesg这种情况下,您应该会看到消息,然后使用智能工具检查驱动器的状态。

相关内容