我终于能够从“任务挂起 120 秒”错误中获得跟踪。我们的服务器会在负载峰值时随机崩溃。IO 等待时间似乎很正常。内存/交换很好。这些是基于 Xen 的 Ubuntu 10.04 - 64 位。VPS,有些甚至是“裸机”。
以下是踪迹:
Aug 24 11:13:25 admin1 kernel: [2732107.501749] INFO: task java:29952 blocked for more than 120 seconds.
Aug 24 11:13:25 admin1 kernel: [2732107.501954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 11:13:25 admin1 kernel: [2732107.502223] java D 0000000000000000 0 29952 1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.502227] ffff8801d3b5fbb8 0000000000000082 ffffffff81a35800 ffff8801d635be8c
Aug 24 11:13:25 admin1 kernel: [2732107.502230] 0000000000014cc0 ffff88014fcc5b80 0000000000014cc0 ffff88014fcc5f18
Aug 24 11:13:25 admin1 kernel: [2732107.502233] ffff8801d3b5ffd8 ffff88014fcc5f20 ffff8801d3b5e010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.502236] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.502243] [<ffffffff81589bd7>] __mutex_lock_slowpath+0xf7/0x180
Aug 24 11:13:25 admin1 kernel: [2732107.502247] [<ffffffff81589abb>] mutex_lock+0x2b/0x50
Aug 24 11:13:25 admin1 kernel: [2732107.502251] [<ffffffff8122633a>] __log_wait_for_space+0xda/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.502255] [<ffffffff814fed2f>] ? inet_recvmsg+0x4f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.502258] [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.502260] [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502263] [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.502266] [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.502269] [<ffffffff811ca63d>] ext3_dirty_inode+0x3d/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502273] [<ffffffff811776ca>] __mark_inode_dirty+0x3a/0x1f0
Aug 24 11:13:25 admin1 kernel: [2732107.502275] [<ffffffff811ca7f5>] ext3_setattr+0x155/0x2c0
Aug 24 11:13:25 admin1 kernel: [2732107.502279] [<ffffffff810097ce>] ? __switch_to+0x26e/0x320
Aug 24 11:13:25 admin1 kernel: [2732107.502283] [<ffffffff8116c2c5>] notify_change+0x175/0x2f0
Aug 24 11:13:25 admin1 kernel: [2732107.502285] [<ffffffff81588781>] ? schedule+0x411/0xa50
Aug 24 11:13:25 admin1 kernel: [2732107.502289] [<ffffffff81153064>] do_truncate+0x64/0xa0
Aug 24 11:13:25 admin1 kernel: [2732107.502291] [<ffffffff81153d7b>] ? generic_file_llseek+0x5b/0x70
Aug 24 11:13:25 admin1 kernel: [2732107.502294] [<ffffffff811532d8>] sys_ftruncate+0xe8/0xf0
Aug 24 11:13:25 admin1 kernel: [2732107.502297] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
另一个就在现在,当运行'unzip'时 - 与上面的跟踪时间完全相同;
Aug 24 11:13:25 admin1 kernel: [2732107.503946] unzip D 0000000000000000 0 28123 1 0x00000000
Aug 24 11:13:25 admin1 kernel: [2732107.503951] ffff880031e156e8 0000000000000086 0000000000000000 0000000000000000
Aug 24 11:13:25 admin1 kernel: [2732107.503954] 0000000000014cc0 ffff8801183f5b80 0000000000014cc0 ffff8801183f5f18
Aug 24 11:13:25 admin1 kernel: [2732107.503957] ffff880031e15fd8 ffff8801183f5f20 ffff880031e14010 0000000000014cc0
Aug 24 11:13:25 admin1 kernel: [2732107.503960] Call Trace:
Aug 24 11:13:25 admin1 kernel: [2732107.503964] [<ffffffff81227745>] log_wait_commit+0xc5/0x150
Aug 24 11:13:25 admin1 kernel: [2732107.503967] [<ffffffff81081150>] ? autoremove_wake_function+0x0/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.503969] [<ffffffff8158b08e>] ? _raw_spin_lock+0xe/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503972] [<ffffffff81225ddb>] log_do_checkpoint+0xab/0x530
Aug 24 11:13:25 admin1 kernel: [2732107.503976] [<ffffffff81036bd6>] ? __ticket_spin_lock+0x16/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.503979] [<ffffffff81226306>] __log_wait_for_space+0xa6/0x1b0
Aug 24 11:13:25 admin1 kernel: [2732107.503982] [<ffffffff81070974>] ? mod_timer+0x174/0x2d0
Aug 24 11:13:25 admin1 kernel: [2732107.503985] [<ffffffff812220e1>] start_this_handle+0xf1/0x440
Aug 24 11:13:25 admin1 kernel: [2732107.503987] [<ffffffff812225c1>] ? journal_start+0x81/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503990] [<ffffffff812225f5>] journal_start+0xb5/0x100
Aug 24 11:13:25 admin1 kernel: [2732107.503993] [<ffffffff811d3e71>] ext3_journal_start_sb+0x31/0x60
Aug 24 11:13:25 admin1 kernel: [2732107.503995] [<ffffffff811ccbe2>] ext3_write_begin+0xa2/0x230
Aug 24 11:13:25 admin1 kernel: [2732107.503998] [<ffffffff8110029e>] generic_file_buffered_write+0xfe/0x250
Aug 24 11:13:25 admin1 kernel: [2732107.504001] [<ffffffff811dc65f>] ? ext3_xattr_security_get+0x2f/0x40
Aug 24 11:13:25 admin1 kernel: [2732107.504004] [<ffffffff81102df0>] __generic_file_aio_write+0x240/0x470
Aug 24 11:13:25 admin1 kernel: [2732107.504006] [<ffffffff81103085>] generic_file_aio_write+0x65/0xd0
Aug 24 11:13:25 admin1 kernel: [2732107.504009] [<ffffffff8115370a>] do_sync_write+0xda/0x120
Aug 24 11:13:25 admin1 kernel: [2732107.504012] [<ffffffff812941a8>] ? apparmor_file_permission+0x18/0x20
Aug 24 11:13:25 admin1 kernel: [2732107.504015] [<ffffffff8126366f>] ? security_file_permission+0x1f/0x80
Aug 24 11:13:25 admin1 kernel: [2732107.504017] [<ffffffff811539f8>] vfs_write+0xc8/0x190
Aug 24 11:13:25 admin1 kernel: [2732107.504020] [<ffffffff81154251>] sys_write+0x51/0x90
Aug 24 11:13:25 admin1 kernel: [2732107.504023] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
有人可以帮忙吗?
这是文件系统的死锁吗?
蒂娅。
答案1
这些堆栈跟踪都表明当时正在尝试文件系统写入操作,因此请检查存储相关问题。如果您有网络挂载的文件系统,请检查服务器是否可以访问且正常运行。如果没有,请检查您的磁盘驱动器。在dmesg
这种情况下,您应该会看到消息,然后使用智能工具检查驱动器的状态。