GlusterFS 脑裂问题

GlusterFS 脑裂问题

我在使用 GlusterFS 设置时遇到了性能问题。我们上线了新版本的应用程序,突然间所有 GlusterFS 客户端和主服务器也开始显示 CPU 使用率过高。这真是让人头疼。我的设置如下:

我有两台 glusterFS 主服务器version 3.7.4

[root@gfs1 glusterfs]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
performance.cache-max-file-size: 2MB
performance.cache-size: 256MB
performance.write-behind-window-size: 4MB
performance.io-thread-count: 32
cluster.data-self-heal-algorithm: diff
nfs.disable: off

[root@gfs2 ec2-user]# gluster volume info

Volume Name: repl-vol
Type: Replicate
Volume ID: 7535cfad-6bb9-4147-9fea-e869e7b8d565
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs1.myhost.com:/GlusterFS/repl-data
Brick2: gfs2.myhost.com:/GlusterFS/repl-data
Options Reconfigured:
cluster.self-heal-window-size: 100
nfs.disable: off
cluster.data-self-heal-algorithm: diff
performance.io-thread-count: 32
performance.write-behind-window-size: 4MB
performance.cache-size: 256MB
performance.cache-max-file-size: 2MB

我有大约 14 个客户端在使用 glusterFS。glusterFS 托管了 1.2TB 的数据,基本上都是静态内容 JS/CSS/图像。我们一直在监控服务器 CPU 利用率的突然飙升。网络 IO 也非常高,125MB/s-250MB/s。我检查了日志,主要反复发现以下问题:

[2015-09-09 03:13:33.797655] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000130641_4.jpg>, ed715d52-4a39-46db-901b-16ae13f01898 on repl-vol-client-1 and 0bc0c058-b6a7-4f0d-9d46-96f7fcded0f3 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.074219] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132992_4.jpg>, 8b67cc38-df53-43c7-ad42-b9c616b980b1 on repl-vol-client-1 and 41f393de-9d83-4f52-bfcf-832e31a27a87 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 03:13:36.076681] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <3fd13508-b29e-4d52-8c9c-14ccd2f24b9f/100000132995_4.jpg>, b1dd578b-3dfe-43dc-ad3a-d54c86298278 on repl-vol-client-1 and bd7c42b9-575f-46bc-9f56-804994f27ab0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:50.975933] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:00:51.005409] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:00:51.011467] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.014205] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:00:51.046092] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.125065] I [MSGID: 108026] [afr-self-heal-entry.c:589:afr_selfheal_entry_do] 0-repl-vol-replicate-0: performing entry selfheal on cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3
[2015-09-09 04:10:53.225256] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.
[2015-09-09 04:10:53.232229] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.236203] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-repl-vol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2015-09-09 04:10:53.343344] E [MSGID: 108008] [afr-self-heal-entry.c:253:afr_selfheal_detect_gfid_and_type_mismatch] 0-repl-vol-replicate-0: Gfid mismatch detected for <cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3/100000160597.jpg>, 68c6fd47-6edc-46fe-8992-2d662bc698e8 on repl-vol-client-1 and 43e1a033-ad08-495b-b762-757cb2f566c0 on repl-vol-client-0. Skipping conservative merge on the file.

两个主要错误是remote operation failedGfid mismatch。我甚至尝试解决裂脑问题,但似乎我做错了什么,或者它不起作用。

恢复步骤:

[root@gfs2 ec2-user]# gluster volume heal repl-vol info split-brain
Brick gfs1.myhost.com:/GlusterFS/repl-data/
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
/media/klevu_images/1/0
Number of entries in split-brain: 2

Brick gfs2.myhost.com:/GlusterFS/repl-data/
/media/klevu_images/1/0
<gfid:cc9d0e49-c9ab-4dab-bca4-1c06c8a7a4e3>
Number of entries in split-brain: 2

所以我直接删除了上面的文件然后尝试gluster volume heal repl-data

我不太确定解决裂脑问题是否能解决我的性能问题。此外,裂脑问题不断出现。我的主要目标是修复性能问题。

相关内容