Ceph 集群 - 数据可用性降低：96 个 pgs 处于非活动状态，并且所有 OSD 节点均已关闭

2024-6-1 • tag-icon

Ceph 集群 - 数据可用性降低：96 个 pgs 处于非活动状态，并且所有 OSD 节点均已关闭

我按照以下步骤设置 Ceph 集群这文档。我有一个管理器节点、一个监视器节点和三个 OSD 节点。问题是，在我完成集群设置后，所有三个节点都ceph health返回了HEALTH_OK。但是，当我回到集群时，它不正常。这是健康检查的输出：

HEALTH_WARN Reduced data availability: 96 pgs inactive
PG_AVAILABILITY Reduced data availability: 96 pgs inactive
    pg 0.0 is stuck inactive for 35164.889973, current state unknown, last acting []
    pg 0.1 is stuck inactive for 35164.889973, current state unknown, last acting []
    pg 0.2 is stuck inactive for 35164.889973, current state unknown, last acting []

以及所有其他 pg。我是 ceph 新手，我不知道为什么会发生这种情况。我正在使用Ceph 版本 13.2.10 模拟（稳定）。我搜索了答案，但其他似乎有同样问题的人并没有遇到节点故障。我的所有 osd 节点都已关闭，这是的输出ceph -s：

  cluster:
    id:     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx
    health: HEALTH_WARN
            Reduced data availability: 96 pgs inactive

  services:
    mon: 1 daemons, quorum server-1
    mgr: server-1(active)
    osd: 3 osds: 0 up, 0 in

  data:
    pools:   2 pools, 96 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             96 unknown

我也检查了 osd 日志，但不明白问题出在哪里，但这几行表明我的 Ceph 版本有问题，必须升级到 luminous，但我已经有了较新的版本：

2021-02-18 22:01:11.994 7fb070e25c00  0 osd.1 14 done with init, starting boot process
2021-02-18 22:01:11.994 7fb070e25c00  1 osd.1 14 start_boot
2021-02-18 22:01:11.998 7fb049add700 -1 osd.1 14 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:11:00.706 7fb050aeb700 -1 osd.1 15 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:35:52.276 7fb050aeb700 -1 osd.1 16 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-18 22:36:08.836 7fb050aeb700 -1 osd.1 17 osdmap require_osd_release < luminous; please upgrade to luminous
2021-02-19 04:05:00.895 7fb0512ec700  1 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace gifting 0x1f00000~100000 to bluefs
2021-02-19 04:05:00.931 7fb0512ec700  1 bluefs add_block_extent bdev 1 0x1f00000~100000
2021-02-19 04:23:51.208 7fb0512ec700  1 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace gifting 0x2400000~400000 to bluefs
2021-02-19 04:23:51.244 7fb0512ec700  1 bluefs add_block_extent bdev 1 0x2400000~400000

我还检查了 osd 版本，ceph tell osd.* version这是输出：

Error ENXIO: problem getting command descriptions from osd.0
osd.0: problem getting command descriptions from osd.0
Error ENXIO: problem getting command descriptions from osd.1
osd.1: problem getting command descriptions from osd.1
Error ENXIO: problem getting command descriptions from osd.2
osd.2: problem getting command descriptions from osd.2

同时ceph-osd --version返回 Ceph 版本 13.2.10 模拟（稳定）。

我不明白问题可能出在哪里。我也试过了systemctl start -l ceph-osd@#，但没有用。我不知道我还能尝试什么，也不知道为什么会发生这种情况。

答案1

我记得我遇到过几次同样的问题。一次问题出在 iptables 上，我忘记在监视器和 OSD 上打开集群网络端口。另一次是因为我的 crushmap 故障域设置为主机，而我运行的是一体化集群，通过将 crushmap 设置为 osd 解决了问题。

答案1

相关内容