ceph 集群非正常关闭后内核任务挂起

ceph 集群非正常关闭后内核任务挂起

我在 kubernetes v1.13 上运行 ceph(由 rook-ceph 操作员 v0.9.3 创建)。在我们的集群不正常关闭后,一些进程随机进入不间断睡眠状态。一段时间后,kubernetes 集群无法安排新的 Pod。查看 dmesg,我发现了以下内容:

[ 3021.890423] INFO: task tp_fstore_op:22689 blocked for more than 120 seconds.
[ 3021.890456]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890504] tp_fstore_op    D    0 22689  20967 0x00000000
[ 3021.890508]  ffff93c0a5dc0080 0000000000000000 ffff93d137954540 ffff93c1fe8d8980
[ 3021.890510]  ffff93bf42e823c0 ffffb9ae3834b7b0 ffffffff9e0144b9 0000000000008000
[ 3021.890512]  0000000000000040 ffff93c1fe8d8980 ffff93c0a9156300 ffff93d137954540
[ 3021.890515] Call Trace:
[ 3021.890524]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890571]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890574]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890576]  [<ffffffff9e017d4d>] ? schedule_timeout+0x1dd/0x380
[ 3021.890602]  [<ffffffffc0b8556d>] ? _xfs_log_force_lsn+0x22d/0x320 [xfs]
[ 3021.890613]  [<ffffffff9daf107e>] ? ktime_get+0x3e/0xb0
[ 3021.890635]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890638]  [<ffffffff9e01421d>] ? io_schedule_timeout+0x9d/0x100
[ 3021.890659]  [<ffffffffc0b71e24>] ? __xfs_iunpin_wait+0xd4/0x160 [xfs]
[ 3021.890662]  [<ffffffff9dabd3f0>] ? wake_atomic_t_function+0x60/0x60
[ 3021.890681]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890699]  [<ffffffffc0b6970e>] ? xfs_reclaim_inodes_ag+0x1de/0x300 [xfs]
[ 3021.890702]  [<ffffffff9db91885>] ? node_dirty_ok+0x125/0x170
[ 3021.890704]  [<ffffffff9dd53419>] ? list_del+0x9/0x30
[ 3021.890707]  [<ffffffff9dbe599a>] ? page_is_poisoned+0xa/0x20
[ 3021.890709]  [<ffffffff9db8ba0e>] ? get_page_from_freelist+0x88e/0xb20
[ 3021.890712]  [<ffffffff9daae1ff>] ? select_task_rq_fair+0x51f/0x7e0
[ 3021.890714]  [<ffffffff9daad9d5>] ? select_idle_sibling+0x25/0x330
[ 3021.890716]  [<ffffffff9daa5674>] ? try_to_wake_up+0x54/0x3c0
[ 3021.890734]  [<ffffffffc0b6a771>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[ 3021.890736]  [<ffffffff9dc0eed8>] ? super_cache_scan+0x188/0x190
[ 3021.890738]  [<ffffffff9db97a0a>] ? shrink_slab.part.38+0x21a/0x440
[ 3021.890740]  [<ffffffff9db9c3ca>] ? shrink_node+0x10a/0x340
[ 3021.890742]  [<ffffffff9db9c6f1>] ? do_try_to_free_pages+0xf1/0x310
[ 3021.890744]  [<ffffffff9dd38b6a>] ? __next_node_in+0x3a/0x50
[ 3021.890745]  [<ffffffff9db9cb73>] ? try_to_free_mem_cgroup_pages+0xc3/0x1a0
[ 3021.890748]  [<ffffffff9dbfd147>] ? try_charge+0x147/0x6f0
[ 3021.890750]  [<ffffffff9dc01237>] ? mem_cgroup_try_charge+0x67/0x1b0
[ 3021.890752]  [<ffffffff9dbbb1d2>] ? handle_mm_fault+0x10e2/0x1310
[ 3021.890755]  [<ffffffff9dc0ac30>] ? new_sync_write+0xe0/0x130
[ 3021.890758]  [<ffffffff9da622f5>] ? __do_page_fault+0x255/0x4f0
[ 3021.890760]  [<ffffffff9e01a618>] ? page_fault+0x28/0x30

此后,访问 RBD 时立即产生类似的错误:

[ 3021.890820] INFO: task xfsaild/rbd2:23307 blocked for more than 120 seconds.
[ 3021.890845]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890896] xfsaild/rbd2    D    0 23307      2 0x00000000
[ 3021.890898]  ffff93c182e46480 0000000000000000 ffff93d0d3a4ca00 ffff93d1fdb58980
[ 3021.890900]  ffff93d1f6a4a180 ffffb9ae24e07d80 ffffffff9e0144b9 0000000000000246
[ 3021.890903]  00ffffff9dae787d ffff93d1fdb58980 e182622c538e97d5 ffff93d0d3a4ca00
[ 3021.890905] Call Trace:
[ 3021.890909]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890911]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890948]  [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.890949]  [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.890973]  [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.890994]  [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.890996]  [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.890998]  [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891000]  [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891002]  [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70
[ 3021.891004] INFO: task xfsaild/rbd3:23438 blocked for more than 120 seconds.
[ 3021.891027]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.891050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.891074] xfsaild/rbd3    D    0 23438      2 0x00000000
[ 3021.891075]  ffff93c0fb0464c0 0000000000000000 ffff93d0a88f61c0 ffff93d1fdd18980
[ 3021.891077]  ffff93d1f6a80340 ffffb9ae24e37d80 ffffffff9e0144b9 0000000000000246
[ 3021.891080]  00ffffff9dae787d ffff93d1fdd18980 10168cfc448e06f4 ffff93d0a88f61c0
[ 3021.891081] Call Trace:
[ 3021.891084]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.891086]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.891108]  [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.891109]  [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.891130]  [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.891151]  [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.891153]  [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.891154]  [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891156]  [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891158]  [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70

dmesg 中有更多错误,但它们都遵循相同的模式:某个进程尝试在 XFS 上执行某些操作,内核任务卡住并且进程保持不间断睡眠状态。

不久之后,libceph 报告 OSD 已关闭:

[ 4218.521314] libceph: osd0 down

Journalctl 没有报告任何其他错误。

由于 Kubernetes Pod 尝试写入的文件对于附加卷而言太大,因此必须进行非正常关闭,因为存在类似问题。该卷由 rook-ceph 提供。这是我使用的配置:

集群配置:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: "ceph/ceph:v13.2.5-20190319"
  dataDirHostPath: "/var/rook/data"
  dashboard:
    enabled: True
    port: 80
    ssl: False
  network:
    hostNetwork: False  # use SDN (Canal) as network
  mon:
    count: 3
    allowMultiplePerNode: True 
  resources:  # http://docs.ceph.com/docs/mimic/start/hardware-recommendations/
    mgr:
      requests:
        cpu: 4
        memory: "2Gi"
      limits:
        cpu: 4
        memory: "2Gi"
    mon:
      requests:
        cpu: 0.5
        memory: "2Gi"
      limits:
        cpu: 0.5
        memory: "2Gi"
    osd:
      requests:
        cpu: 2
        memory: "5Gi"
      limits:
        cpu: 2
        memory: "5Gi"
  storage:
    useAllNodes: False
    nodes:
    - name: "kubernetes-master"  # matches node label: kubernetes.io/hostname
    useAllDevices: False
    directories:
    - path: "/var/rook/filestore"

BlockPool配置:

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: volatile-replicapool
  namespace: rook-ceph
spec:
  failureDomain: osd
  replicated:
    size: 1

以及 StorageClasses:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: ceph-block-development
provisioner: ceph.rook.io/block
parameters:
  blockPool: volatile-replicapool
  clusterNamespace: rook-ceph
  fstype: xfs
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: ceph-block-production
provisioner: ceph.rook.io/block
parameters:
  blockPool: volatile-replicapool
  clusterNamespace: rook-ceph
  fstype: xfs
reclaimPolicy: Retain

我在跑步Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64

如能提供关于如何调试该问题的任何指示,我们将不胜感激。

提前致谢。

相关内容