休眠后 XFS 内存损坏

休眠后 XFS 内存损坏

我经常收到以下错误,其中 XFS 文件系统存在于软件 raid-1 上,然后将其转换为 3 磁盘 raid-5。错误仅在休眠后发生,通常是立即发生或几分钟后发生。dmesg告诉(完整dmesg输出在此处:http://bpaste.net/show/130895/):

[155389.814032] PM: restore of devices complete after 1700.425 msecs
[155389.814783] Restarting tasks ... done.
[155390.161993] r8168: enp2s0: link up
[155392.181215] r8168: enp2s0: link up
[155398.859967] sd 7:0:0:0: [sdh] No Caching mode page present
[155398.859972] sd 7:0:0:0: [sdh] Assuming drive cache: write through
[155398.876927] sd 7:0:0:0: [sdh] No Caching mode page present
[155398.876932] sd 7:0:0:0: [sdh] Assuming drive cache: write through
[155398.877945]  sdh:
[155690.215471] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 342 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff812049d1

[155690.215478] CPU: 5 PID: 17532 Comm: kworker/5:0 Tainted: P           O 3.10.7-gentoo #1
[155690.215481] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 0601 07/17/2012
[155690.215490] Workqueue: xfsalloc xfs_bmapi_allocate_worker
[155690.215493]  ffffffff81565b8a 0000000000000071 ffffffff81201c57 ffff880418328000
[155690.215498]  ffff880418328270 ffff8803124ee460 0000000081206839 0000000000000800
[155690.215502]  ffff8803990ffd18 ffff880418328000 0000000000000800 0000000000000800
[155690.215506] Call Trace:
[155690.215514]  [<ffffffff81565b8a>] ? dump_stack+0xd/0x17
[155690.215520]  [<ffffffff81201c57>] ? xfs_alloc_fixup_trees+0x1e7/0x370
[155690.215524]  [<ffffffff812049d1>] ? xfs_alloc_ag_vextent_near+0xa21/0xd90
[155690.215528]  [<ffffffff81204dfd>] ? xfs_alloc_ag_vextent+0xbd/0xf0
[155690.215532]  [<ffffffff81205aa8>] ? xfs_alloc_vextent+0x478/0x800
[155690.215536]  [<ffffffff812139d6>] ? xfs_bmap_btalloc_nullfb+0x316/0x350
[155690.215541]  [<ffffffff8121721a>] ? xfs_bmap_btalloc+0x31a/0x770
[155690.215546]  [<ffffffff810459f8>] ? internal_add_timer+0x18/0x50
[155690.215551]  [<ffffffff810459f8>] ? internal_add_timer+0x18/0x50
[155690.215556]  [<ffffffff81217c4d>] ? __xfs_bmapi_allocate+0xcd/0x2e0
[155690.215560]  [<ffffffff81217e9c>] ? xfs_bmapi_allocate_worker+0x3c/0x70
[155690.215566]  [<ffffffff810535d0>] ? process_one_work+0x150/0x480
[155690.215570]  [<ffffffff81053f3a>] ? manage_workers.isra.26+0x1aa/0x2b0
[155690.215575]  [<ffffffff81054154>] ? worker_thread+0x114/0x370
[155690.215579]  [<ffffffff81054040>] ? manage_workers.isra.26+0x2b0/0x2b0
[155690.215584]  [<ffffffff8105a163>] ? kthread+0xb3/0xc0
[155690.215588]  [<ffffffff81060000>] ? async_run_entry_fn+0xf0/0x120
[155690.215593]  [<ffffffff8105a0b0>] ? kthread_freezable_should_stop+0x60/0x60
[155690.215598]  [<ffffffff815708ec>] ? ret_from_fork+0x7c/0xb0
[155690.215603]  [<ffffffff8105a0b0>] ? kthread_freezable_should_stop+0x60/0x60
[155690.215619] XFS (md1): page discard on page ffffea000d1df580, inode 0x22057716, offset 8323072.
[155720.362810] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 342 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff812049d1

<...> (a big bunch of similar errors skipped)

[156100.313075] CPU: 4 PID: 27035 Comm: kworker/4:2 Tainted: P           O 3.10.7-gentoo #1
[156100.313078] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 0601 07/17/2012
[156100.313099] Workqueue: xfsalloc xfs_bmapi_allocate_worker
[156100.313103]  ffffffff81565b8a 0000000000000071 ffffffff81201c57 ffff88041a811d00
[156100.313107]  ffff88041a811dd0 0000000000000001 000000007f95bfd8 0000000000000002
[156100.313111]  ffff88017f95bd18 ffff88041a811d00 0000000000000001 0000000000000001
[156100.313115] Call Trace:
[156100.313123]  [<ffffffff81565b8a>] ? dump_stack+0xd/0x17
[156100.313129]  [<ffffffff81201c57>] ? xfs_alloc_fixup_trees+0x1e7/0x370
[156100.313133]  [<ffffffff8120491a>] ? xfs_alloc_ag_vextent_near+0x96a/0xd90
[156100.313138]  [<ffffffff81204dfd>] ? xfs_alloc_ag_vextent+0xbd/0xf0
[156100.313141]  [<ffffffff81205aa8>] ? xfs_alloc_vextent+0x478/0x800
[156100.313146]  [<ffffffff812139d6>] ? xfs_bmap_btalloc_nullfb+0x316/0x350
[156100.313150]  [<ffffffff8121721a>] ? xfs_bmap_btalloc+0x31a/0x770
[156100.313156]  [<ffffffff810459f8>] ? internal_add_timer+0x18/0x50
[156100.313161]  [<ffffffff810459f8>] ? internal_add_timer+0x18/0x50
[156100.313165]  [<ffffffff81217c4d>] ? __xfs_bmapi_allocate+0xcd/0x2e0
[156100.313170]  [<ffffffff81217e9c>] ? xfs_bmapi_allocate_worker+0x3c/0x70
[156100.313176]  [<ffffffff810535d0>] ? process_one_work+0x150/0x480
[156100.313186]  [<ffffffff81054154>] ? worker_thread+0x114/0x370
[156100.313208]  [<ffffffff81054040>] ? manage_workers.isra.26+0x2b0/0x2b0
[156100.313214]  [<ffffffff8105a163>] ? kthread+0xb3/0xc0
[156100.313228]  [<ffffffff81060000>] ? async_run_entry_fn+0xf0/0x120
[156100.313239]  [<ffffffff8105a0b0>] ? kthread_freezable_should_stop+0x60/0x60
[156100.313249]  [<ffffffff815708ec>] ? ret_from_fork+0x7c/0xb0
[156100.313258]  [<ffffffff8105a0b0>] ? kthread_freezable_should_stop+0x60/0x60
[156100.313275] XFS (md1): page discard on page ffffea0008f25340, inode 0x22057716, offset 8499200.
[156155.266439] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1617 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff81205f1c

[156155.266443] CPU: 4 PID: 32209 Comm: QThread Tainted: P           O 3.10.7-gentoo #1
[156155.266444] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 0601 07/17/2012
[156155.266446]  ffffffff81565b8a 0000000000000070 ffffffff81202e8c ffff88041b3c3980
[156155.266448]  ffff88041821ee40 0000000000000000 0000000000000003 ffff88041b3c3980
[156155.266449]  ffff880417bf6800 0000000000000000 ffff8801eac77c5c 0000000800000000
[156155.266451] Call Trace:
[156155.266456]  [<ffffffff81565b8a>] ? dump_stack+0xd/0x17
[156155.266460]  [<ffffffff81202e8c>] ? xfs_free_ag_extent+0x53c/0x850
[156155.266461]  [<ffffffff81205f1c>] ? xfs_free_extent+0xec/0x130
[156155.266463]  [<ffffffff8120128e>] ? kmem_zone_alloc+0x5e/0xe0
[156155.266465]  [<ffffffff8121939a>] ? xfs_bmap_finish+0x16a/0x1b0
[156155.266467]  [<ffffffff812396b3>] ? xfs_itruncate_extents+0x103/0x320
[156155.266469]  [<ffffffff811ff4ce>] ? xfs_inactive+0x32e/0x450
[156155.266470]  [<ffffffff811fcb8b>] ? xfs_fs_evict_inode+0x4b/0x130
[156155.266473]  [<ffffffff8112ca87>] ? evict+0xa7/0x1b0
[156155.266476]  [<ffffffff8112143c>] ? do_unlinkat+0x19c/0x1f0
[156155.266477]  [<ffffffff81118f53>] ? SyS_newstat+0x23/0x30
[156155.266480]  [<ffffffff81570992>] ? system_call_fastpath+0x16/0x1b
[156155.266483] XFS (md1): xfs_do_force_shutdown(0x8) called from line 916 of file fs/xfs/xfs_bmap.c.  Return address = 0xffffffff812193d3
[156155.445552] XFS (md1): Corruption of in-memory data detected.  Shutting down filesystem
[156155.445557] XFS (md1): Please umount the filesystem and rectify the problem(s)
[156160.004902] XFS (md1): xfs_log_force: error 5 returned.
[156190.132832] XFS (md1): xfs_log_force: error 5 returned.
[156220.260719] XFS (md1): xfs_log_force: error 5 returned.
[156250.388550] XFS (md1): xfs_log_force: error 5 returned.
[156280.516400] XFS (md1): xfs_log_force: error 5 returned.
[156310.644246] XFS (md1): xfs_log_force: error 5 returned.
[156340.772019] XFS (md1): xfs_log_force: error 5 returned.
[156370.899941] XFS (md1): xfs_log_force: error 5 returned.
[156401.027736] XFS (md1): xfs_log_force: error 5 returned.
[156431.155576] XFS (md1): xfs_log_force: error 5 returned.
[156461.283434] XFS (md1): xfs_log_force: error 5 returned.
[156491.411366] XFS (md1): xfs_log_force: error 5 returned.
[156521.539215] XFS (md1): xfs_log_force: error 5 returned.
[156551.666963] XFS (md1): xfs_log_force: error 5 returned.
[156581.795447] XFS (md1): xfs_log_force: error 5 returned.
[156611.922687] XFS (md1): xfs_log_force: error 5 returned.
[156642.050630] XFS (md1): xfs_log_force: error 5 returned.
[156672.178470] XFS (md1): xfs_log_force: error 5 returned.
[156702.306332] XFS (md1): xfs_log_force: error 5 returned.
[156732.434176] XFS (md1): xfs_log_force: error 5 returned.
[156762.561988] XFS (md1): xfs_log_force: error 5 returned.

内核版本是 3.10.7,在 3.8.13 上看到同样的错误。请注意,md1 不是用于 XFS 文件系统的唯一 RAID 设备:我还在 RAID1(SSD+HDD)上保留了 /。

答案1

如果你有 i915 卡,请尝试更新内核,有几份报告称这些卡上的 KMS 导致休眠/恢复时内存损坏

您可以尝试使用 i915.modeset=0 进行测试,看看问题是否消失……如果确实如此,则可能是相同的问题,并且应该已经在较新的内核中得到修复

如果您没有 i915,则不知道可能是什么问题..尝试使用 BIOS/电源设置,在没有 X 的情况下休眠以测试另一个图形卡相关的损坏,运行 memtest86 来检查实际内存问题,尝试最近或更旧的内核...并且如果问题可重现,则可能打开发行版/内核错误。

答案2

从休眠状态恢复后,我在 3.16.5 中看到了类似的调用跟踪。我已切换到 ext4,没有发现任何问题。

我的设置使用了 xfs roofs 和文件交换(不是交换分区)。挂起和休眠工作正常,并且恢复工作正常,除了休眠的一些调用跟踪之外。

在开始中等文件操作之前,系统处于半可用状态。试金石是运行:

# tail -f /var/log/messages

然后休眠。恢复后系统无响应。使用 ext4 时我看不到任何调用痕迹,系统看起来正常。

当我看到这个时,我怀疑 xfs 存在一些问题:

Dec 20 23:57:58 localhost kernel: [ 9053.841446] general protection fault: 0000 [#2] SMP 
Dec 20 23:57:58 localhost kernel: [ 9053.841514] Modules linked in: vboxnetadp(O) vboxnetflt(O) vboxdrv(O) x86_pkg_temp_thermal
Dec 20 23:57:58 localhost kernel: [ 9053.841618] CPU: 1 PID: 13937 Comm: mozStorage #10 Tainted: G      D    O  3.16.5-gentoo #22
Dec 20 23:57:58 localhost kernel: [ 9053.841702] Hardware name: ASUSTeK COMPUTER INC. G56JK/G56JK, BIOS G56JK.201 05/13/2014
Dec 20 23:57:58 localhost kernel: [ 9053.841782] task: ffff8800c5658880 ti: ffff88011262c000 task.ti: ffff88011262c000
Dec 20 23:57:58 localhost kernel: [ 9053.841854] RIP: 0010:[<ffffffff8113ae68>]  [<ffffffff8113ae68>] kmem_cache_alloc+0x58/0x130
Dec 20 23:57:58 localhost kernel: [ 9053.841947] RSP: 0018:ffff88011262fb60  EFLAGS: 00010282
Dec 20 23:57:58 localhost kernel: [ 9053.842000] RAX: 0000000000000000 RBX: ffffea00021c47c0 RCX: 0000000000180d5c
Dec 20 23:57:58 localhost kernel: [ 9053.842069] RDX: 0000000000180d5b RSI: 0000000000008050 RDI: ffff88012a44ca00
Dec 20 23:57:58 localhost kernel: [ 9053.842138] RBP: ffff88011262fb90 R08: 0000000000016300 R09: ffffea00021c4800
Dec 20 23:57:58 localhost kernel: [ 9053.842208] R10: 0000000000000a95 R11: 0000000000000000 R12: 00f7000000f60000
Dec 20 23:57:58 localhost kernel: [ 9053.842277] R13: 0000000000008050 R14: ffff88012a44ca00 R15: ffffffff81171e4c
Dec 20 23:57:58 localhost kernel: [ 9053.842348] FS:  00007f083e0fc700(0000) GS:ffff88012ee40000(0000) knlGS:0000000000000000
Dec 20 23:57:58 localhost kernel: [ 9053.842427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 20 23:57:58 localhost kernel: [ 9053.842484] CR2: 00007f08403780e3 CR3: 000000011dc39000 CR4: 00000000001407e0
Dec 20 23:57:58 localhost kernel: [ 9053.842553] Stack:
Dec 20 23:57:58 localhost kernel: [ 9053.842576]  0000000000000483 ffffea00021c47c0 0000000000000000 0000000000001000
Dec 20 23:57:58 localhost kernel: [ 9053.842661]  ffffea00021c47c0 0000000000000000 ffff88011262fba8 ffffffff81171e4c
Dec 20 23:57:58 localhost kernel: [ 9053.842745]  ffff88011262fbd8 ffff88011262fbe8 ffffffff81172102 0000000100000000
Dec 20 23:57:58 localhost kernel: [ 9053.842829] Call Trace:
Dec 20 23:57:58 localhost kernel: [ 9053.842867]  [<ffffffff81171e4c>] alloc_buffer_head+0x1c/0x60
Dec 20 23:57:58 localhost kernel: [ 9053.842926]  [<ffffffff81172102>] alloc_page_buffers+0x32/0xb0
Dec 20 23:57:58 localhost kernel: [ 9053.842987]  [<ffffffff81173179>] create_empty_buffers+0x19/0xc0
Dec 20 23:57:58 localhost kernel: [ 9053.843048]  [<ffffffff81173267>] create_page_buffers+0x47/0x50
Dec 20 23:57:58 localhost kernel: [ 9053.843108]  [<ffffffff811742c8>] __block_write_begin+0x68/0x430
Dec 20 23:57:58 localhost kernel: [ 9053.846472]  [<ffffffff811009d9>] ? lru_cache_add+0x9/0x10
Dec 20 23:57:58 localhost kernel: [ 9053.849840]  [<ffffffff810f39f8>] ? add_to_page_cache_lru+0x48/0x70
Dec 20 23:57:58 localhost kernel: [ 9053.852926]  [<ffffffff811eeef0>] ? __xfs_get_blocks+0x4f0/0x4f0
Dec 20 23:57:58 localhost kernel: [ 9053.855809]  [<ffffffff810f3ca5>] ? pagecache_get_page+0x95/0x1e0
Dec 20 23:57:58 localhost kernel: [ 9053.858662]  [<ffffffff811edbec>] xfs_vm_write_begin+0x4c/0xe0
Dec 20 23:57:58 localhost kernel: [ 9053.861460]  [<ffffffff8114cb61>] ? terminate_walk+0x41/0x50
Dec 20 23:57:58 localhost kernel: [ 9053.863974]  [<ffffffff810f351d>] generic_perform_write+0xbd/0x1b0
Dec 20 23:57:58 localhost kernel: [ 9053.866483]  [<ffffffff811fa3e0>] xfs_file_buffered_aio_write.isra.9+0x100/0x1b0
Dec 20 23:57:58 localhost kernel: [ 9053.869005]  [<ffffffff811fa50e>] xfs_file_write_iter+0x7e/0x120
Dec 20 23:57:58 localhost kernel: [ 9053.871518]  [<ffffffff81142bcc>] new_sync_write+0x7c/0xb0
Dec 20 23:57:58 localhost kernel: [ 9053.874024]  [<ffffffff81143322>] vfs_write+0xb2/0x1f0
Dec 20 23:57:58 localhost kernel: [ 9053.876520]  [<ffffffff81143e71>] SyS_write+0x41/0xb0
Dec 20 23:57:58 localhost kernel: [ 9053.878980]  [<ffffffff81142e43>] ? SyS_lseek+0x43/0xb0
Dec 20 23:57:58 localhost kernel: [ 9053.881405]  [<ffffffff81783e92>] system_call_fastpath+0x16/0x1b
Dec 20 23:57:58 localhost kernel: [ 9053.883610] Code: 8b 06 65 4c 03 04 25 a8 cd 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 74 68 48 85 c0 74 63 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 c1 49 63 
Dec 20 23:57:58 localhost kernel: [ 9053.886116] RIP  [<ffffffff8113ae68>] kmem_cache_alloc+0x58/0x130
Dec 20 23:57:58 localhost kernel: [ 9053.888473]  RSP <ffff88011262fb60>
Dec 20 23:57:58 localhost kernel: [ 9053.907290] ---[ end trace a68ba9204a0e3b64 ]---

另一方面,以下情况更常见:

Dec 21 13:38:12 localhost kernel: [ 3276.619889] general protection fault: 0000 [#1] SMP 
Dec 21 13:38:12 localhost kernel: [ 3276.619914] Modules linked in: vboxnetadp(O) vboxnetflt(O) vboxdrv(O) x86_pkg_temp_thermal
Dec 21 13:38:12 localhost kernel: [ 3276.619948] CPU: 6 PID: 10285 Comm: udevadm Tainted: G           O  3.16.5-gentoo #22
Dec 21 13:38:12 localhost kernel: [ 3276.619973] Hardware name: ASUSTeK COMPUTER INC. G56JK/G56JK, BIOS G56JK.201 05/13/2014
Dec 21 13:38:12 localhost kernel: [ 3276.619999] task: ffff8801292e2a80 ti: ffff8800c3754000 task.ti: ffff8800c3754000
Dec 21 13:38:12 localhost kernel: [ 3276.620023] RIP: 0010:[<ffffffff8115a6d8>]  [<ffffffff8115a6d8>] __d_lookup_rcu+0x78/0x160
Dec 21 13:38:12 localhost kernel: [ 3276.620054] RSP: 0018:ffff8800c3757c68  EFLAGS: 00010206
Dec 21 13:38:12 localhost kernel: [ 3276.620071] RAX: 0000000000000004 RBX: 006b0000006a0000 RCX: 000000000000000d
Dec 21 13:38:12 localhost kernel: [ 3276.620094] RDX: 0000000000690000 RSI: ffff8800c3757e60 RDI: ffff88012a0be540
Dec 21 13:38:12 localhost kernel: [ 3276.620116] RBP: ffff8800c3757ca8 R08: ffff8800c3757dec R09: ffff8800c3757ccc
Dec 21 13:38:12 localhost kernel: [ 3276.620138] R10: ffff880129a0202f R11: 0000000000000004 R12: ffff88012a0be540
Dec 21 13:38:12 localhost kernel: [ 3276.620160] R13: 006b00000069fff8 R14: 00000004d16f3f17 R15: ffff8800c3757e60
Dec 21 13:38:12 localhost kernel: [ 3276.620183] FS:  00007f50c54097c0(0000) GS:ffff88012ef80000(0000) knlGS:0000000000000000
Dec 21 13:38:12 localhost kernel: [ 3276.620208] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 21 13:38:12 localhost kernel: [ 3276.620226] CR2: 00007fff115e7e10 CR3: 000000009655d000 CR4: 00000000001407e0
Dec 21 13:38:12 localhost kernel: [ 3276.620248] Stack:
Dec 21 13:38:12 localhost kernel: [ 3276.620255]  000280d00000ffff 0000000000000001 0000000000000246 ffff8800c3757df8
Dec 21 13:38:12 localhost kernel: [ 3276.620282]  ffff8800c3757d70 ffff8800c3757df8 ffff8800c3757e50 ffff88012a0be540
Dec 21 13:38:12 localhost kernel: [ 3276.620309]  ffff8800c3757cf8 ffffffff8114cbdf ffff8801292e2a80 ffff8800c4bca020
Dec 21 13:38:12 localhost kernel: [ 3276.620336] Call Trace:
Dec 21 13:38:12 localhost kernel: [ 3276.620348]  [<ffffffff8114cbdf>] lookup_fast+0x3f/0x2c0
Dec 21 13:38:12 localhost kernel: [ 3276.620367]  [<ffffffff81150e02>] do_last+0xa2/0x1130
Dec 21 13:38:12 localhost kernel: [ 3276.620385]  [<ffffffff8114dc89>] ? link_path_walk+0x69/0x890
Dec 21 13:38:12 localhost kernel: [ 3276.620405]  [<ffffffff81151f46>] path_openat+0xb6/0x630
Dec 21 13:38:12 localhost kernel: [ 3276.620423]  [<ffffffff81152c25>] do_filp_open+0x35/0x80
Dec 21 13:38:12 localhost kernel: [ 3276.620442]  [<ffffffff8115ea0d>] ? __alloc_fd+0x7d/0x120
Dec 21 13:38:12 localhost kernel: [ 3276.620460]  [<ffffffff81142513>] do_sys_open+0x123/0x220
Dec 21 13:38:12 localhost kernel: [ 3276.620479]  [<ffffffff81142629>] SyS_open+0x19/0x20
Dec 21 13:38:12 localhost kernel: [ 3276.620497]  [<ffffffff81783e92>] system_call_fastpath+0x16/0x1b
Dec 21 13:38:12 localhost kernel: [ 3276.620516] Code: 83 e3 fe 0f 84 b2 00 00 00 4c 89 f0 48 c1 e8 20 49 89 c3 eb 12 66 0f 1f 44 00 00 48 8b 1b 48 85 db 0f 84 94 00 00 00 4c 8d 6b f8 <8b> 53 fc 4c 39 63 10 75 e7 48 83 7b 08 00 74 e0 83 e2 fe 41 f6 
Dec 21 13:38:12 localhost kernel: [ 3276.620635] RIP  [<ffffffff8115a6d8>] __d_lookup_rcu+0x78/0x160
Dec 21 13:38:12 localhost kernel: [ 3276.620656]  RSP <ffff8800c3757c68>
Dec 21 13:38:12 localhost kernel: [ 3276.627506] ---[ end trace 74da4d08df847136 ]---

相关内容