Centos7.2,Ceph 带有 3 个 OSD,1 个 MON 运行在同一个节点上。radosgw 和所有守护进程都运行在同一个节点上,一切运行正常。重启服务器后,所有 osd 都无法通信(看起来),radosgw 无法正常工作,它的日志显示:
2016-03-09 17:03:30.916678 7fc71bbce880 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 24181
2016-03-09 17:08:30.919245 7fc712da8700 -1 Initialization timeout, failed to initialize
ceph health
显示:
HEALTH_WARN 1760 pgs stale; 1760 pgs stuck stale; too many PGs per OSD (1760 > max 300); 2/2 in osds are down
并ceph osd tree
给出:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.01999 root default
-2 1.01999 host app112
0 1.00000 osd.0 down 1.00000 1.00000
1 0.01999 osd.1 down 0 1.00000
-3 1.00000 host node146
2 1.00000 osd.2 down 1.00000 1.00000
和service ceph status
结果:
=== mon.app112 ===
mon.app112: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}
这是service radosgw status
:
Redirecting to /bin/systemctl status radosgw.service
● ceph-radosgw.service - LSB: radosgw RESTful rados gateway
Loaded: loaded (/etc/rc.d/init.d/ceph-radosgw)
Active: active (exited) since Wed 2016-03-09 17:03:30 CST; 1 day 23h ago
Docs: man:systemd-sysv-generator(8)
Process: 24134 ExecStop=/etc/rc.d/init.d/ceph-radosgw stop (code=exited, status=0/SUCCESS)
Process: 2890 ExecReload=/etc/rc.d/init.d/ceph-radosgw reload (code=exited, status=0/SUCCESS)
Process: 24153 ExecStart=/etc/rc.d/init.d/ceph-radosgw start (code=exited, status=0/SUCCESS)
看到这种情况,我尝试了 sudo /etc/init.d/ceph -a start osd.1 和 stop 几次,但结果与上面相同。
sudo /etc/init.d/ceph -a stop osd.1
=== osd.1 ===
Stopping Ceph osd.1 on open-kvm-app92...kill 12688...kill 12688...done
sudo /etc/init.d/ceph -a start osd.1
=== osd.1 ===
create-or-move updated item name 'osd.1' weight 0.02 at location {host=open-kvm-app92,root=default} to crush map
Starting Ceph osd.1 on open-kvm-app92...
Running as unit ceph-osd.1.1457684205.040980737.service.
请帮忙。谢谢
编辑:似乎 mon 无法与 osd 通信。但两个守护进程都运行正常。osd 日志显示:
2016-03-11 17:35:21.649712 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:22.649982 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:23.650262 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:24.650538 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:25.650807 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:25.779693 7f0024c96700 5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])
2016-03-11 17:35:26.651059 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:27.651314 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:28.080165 7f0024c96700 5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])
答案1
我最终还是弄清楚了问题所在。我不得不手动将我们的 crushmap 中的“type host”更改为“type osd”,这与 Spongman 的建议不同。
启动rgw后,我发现radosgw进程的所有者是“root”,而不是“ceph”。命令“ceph -s”也显示“100.000% pgs not active”。
我搜索线索“100.000% pgs not active”,帖子“https://www.cnblogs.com/boshen-hzb/p/13305560.html”告诉如何解决它 - 将'type host'更改为'type osd',结果,“ceph -s”显示“HEALTH_OK”,并且radosgw进程的所有者变为“ceph”,并且rgw web服务(7480)正在监听。