Ceph 读写性能问题,读快写慢
你好,我们需要把所有云环境迁移到Proxmox,目前我正在评估和测试Proxmox+Ceph+OpenStack。
但现在我们面临以下困难:
- 在VMware vSAN迁移到ceph的时候,发现hdd+ssd在ceph中表现非常差,写入性能非常差,性能远不及vSAN
- Ceph 在全闪存结构下的顺序写性能不如单块硬盘,甚至不如单块机械硬盘
- 在bcache中使用hdd+ssd结构时,ceph的顺序写性能远低于单硬盘
请原谅我糟糕的英语。
测试服务器参数(这并不重要)
CPU:双Intel® Xeon® E5-2698Bv3
内存: 8x16G DDR3
双1Gbit网卡:瑞昱半导体有限公司RTL8111/8168/8411
磁盘:
1 x 500G NVME SAMSUNG MZALQ512HALU-000L1(它也是ssd-data
PVE 中的 Thinpool)
1 x 500G SATA WDC_WD5000AZLX-60K2TA0(物理机系统盘)
2 x 500G SATA WDC_WD5000AZLX-60K2TA0
1 x 1T SATA ST1000LM035-1RK172
PVE:pve-manager/7.3-4/d69b70d4(运行内核:5.15.74-1-pve)
网络配置:
enp4s0(OVS 端口)-> vmbr0(OVS 桥)-> br0mgmt(192.168.1.3/24,192.168.1.1)
enp5s0(OVS 端口,MTU=9000)-> vmbr1(OVS 桥,MTU=9000)
vmbr2(OVS 桥,MTU=9000)
测试虚拟机参数x 3(三台虚拟机均为相同参数)
CPU:32(1 插槽,32 核)[主机]
内存:32G
磁盘:
1 x 本地-lvm:vm-101-磁盘-0,iothread=1,大小=32G
2 x ssd-数据:vm-101-磁盘-0,iothread=1,大小=120G
网络设备:
net0:桥=vmbr0,防火墙=1
net1:bridge=vmbr2,firewall=1,mtu=1(Ceph 集群/公共网络)
net2:桥接=vmbr0,防火墙=1
net3:桥=vmbr0,防火墙=1
网络配置:
ens18(net0,OVS 端口)-> vmbr0(OVS 桥)-> br0mgmt(10.10.1.11/24,10.10.1.1)
ens19 (net1,OVS 端口,MTU=9000) -> vmbr1 (OVS 桥接,MTU=9000) -> br1ceph (192.168.10.1/24,MTU=9000)
ens20(net2,网络设备,活动=否)
ens21(net3,网络设备,活动=否)
基准测试工具
- 菲奥
- fio-cdm (https://github.com/xlucn/fio-cdm)
对于fio-cdm,如果不填写任何参数,则fio对应的配置文件如下
使用 'python fio-cdm -f-' 获取
[global]
ioengine=libaio
filename=.fio_testmark
directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting
[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall
[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall
[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall
[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall
[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall
[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall
[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall
[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall
环境搭建步骤
# prepare tools
root@pve01:~# apt update -y && apt upgrade -y
root@pve01:~# apt install fio git -y
root@pve01:~# git clone https://github.com/xlucn/fio-cdm.git
# create test block
root@pve01:~# rbd create test -s 20G
root@pve01:~# rbd map test
root@pve01:~# mkfs.xfs /dev/rbd0
root@pve01:~# mkdir /mnt/test
root@pve01:/mnt# mount /dev/rbd0 /mnt/test
# start test
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
环境测试
- 网络带宽
root@pve01:~# apt install iperf3 -y
root@pve01:~# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 10.10.1.12, port 52968
[ 5] local 10.10.1.11 port 5201 connected to 10.10.1.12 port 52972
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.87 GBytes 16.0 Gbits/sec
[ 5] 1.00-2.00 sec 1.92 GBytes 16.5 Gbits/sec
[ 5] 2.00-3.00 sec 1.90 GBytes 16.4 Gbits/sec
[ 5] 3.00-4.00 sec 1.90 GBytes 16.3 Gbits/sec
[ 5] 4.00-5.00 sec 1.85 GBytes 15.9 Gbits/sec
[ 5] 5.00-6.00 sec 1.85 GBytes 15.9 Gbits/sec
[ 5] 6.00-7.00 sec 1.70 GBytes 14.6 Gbits/sec
[ 5] 7.00-8.00 sec 1.75 GBytes 15.0 Gbits/sec
[ 5] 8.00-9.00 sec 1.89 GBytes 16.2 Gbits/sec
[ 5] 9.00-10.00 sec 1.87 GBytes 16.0 Gbits/sec
[ 5] 10.00-10.04 sec 79.9 MBytes 15.9 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 18.6 GBytes 15.9 Gbits/sec receiver
- 巨型帧
root@pve01:~# ping -M do -s 8000 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 8000(8028) bytes of data.
8008 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=1.51 ms
8008 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.500 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.500/1.007/1.514/0.507 ms
root@pve01:~#
基准类别
- 物理磁盘基准测试
- 单 OSD、单服务器基准测试
- 多个 OSD、单个服务器基准测试
- 多个 OSD、多个服务器基准测试
基准测试结果(Ceph 和系统均未进行调优,且未使用 bcache 加速)
1.物理磁盘基准测试(测试顺序为4)
步。
root@pve1:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part /boot/efi
└─sda3 8:3 0 465.3G 0 part
├─pve-root 253:0 0 96G 0 lvm /
├─pve-data_tmeta 253:1 0 3.5G 0 lvm
│ └─pve-data-tpool 253:3 0 346.2G 0 lvm
│ ├─pve-data 253:4 0 346.2G 1 lvm
│ └─pve-vm--100--disk--0 253:5 0 16G 0 lvm
└─pve-data_tdata 253:2 0 346.2G 0 lvm
└─pve-data-tpool 253:3 0 346.2G 0 lvm
├─pve-data 253:4 0 346.2G 1 lvm
└─pve-vm--100--disk--0 253:5 0 16G 0 lvm
sdb 8:16 0 931.5G 0 disk
sdc 8:32 0 465.8G 0 disk
sdd 8:48 0 465.8G 0 disk
nvme0n1 259:0 0 476.9G 0 disk
root@pve1:~# mkfs.xfs /dev/nvme0n1 -f
root@pve1:~# mkdir /mnt/nvme
root@pve1:~# mount /dev/nvme0n1 /mnt/nvme
root@pve1:~# cd /mnt/nvme/
结果。
root@pve1:/mnt/nvme# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/nvme 3.4GiB/476.7GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 2361.95| 1435.48|
|SEQ1M Q1 T1 | 1629.84| 1262.63|
|RND4K Q32T16| 954.86| 1078.88|
|. IOPS | 233119.53| 263398.08|
|. latency us| 2194.84| 1941.78|
|RND4K Q1 T1 | 55.56| 225.06|
|. IOPS | 13565.49| 54946.21|
|. latency us| 72.76| 16.97|
2. 单osd、单服务器基准测试(测试顺序为3)
修改ceph.conf
集 osd_pool_default_min_size
和osd_pool_default_size
为1,然后systemctl restart ceph.target
和修复所有错误
步。
root@pve01:/mnt/test# ceph osd pool get rbd size
size: 2
root@pve01:/mnt/test# ceph config set global mon_allow_pool_size_one true
root@pve01:/mnt/test# ceph osd pool set rbd min_size 1
set pool 2 min_size to 1
root@pve01:/mnt/test# ceph osd pool set rbd size 1 --yes-i-really-mean-it
set pool 2 size to 1
结果
root@pve01:/mnt/test# ceph -s
cluster:
id: 1f3eacc8-2488-4e1a-94bf-7181ee7db522
health: HEALTH_WARN
2 pool(s) have no replicas configured
services:
mon: 3 daemons, quorum pve01,pve02,pve03 (age 17m)
mgr: pve01(active, since 17m), standbys: pve02, pve03
osd: 6 osds: 1 up (since 19s), 1 in (since 96s)
data:
pools: 2 pools, 33 pgs
objects: 281 objects, 1.0 GiB
usage: 1.1 GiB used, 119 GiB / 120 GiB avail
pgs: 33 active+clean
root@pve01:/mnt/test# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.70312 root default
-3 0.23438 host pve01
0 ssd 0.11719 osd.0 up 1.00000 1.00000
1 ssd 0.11719 osd.1 down 0 1.00000
-5 0.23438 host pve02
2 ssd 0.11719 osd.2 down 0 1.00000
3 ssd 0.11719 osd.3 down 0 1.00000
-7 0.23438 host pve03
4 ssd 0.11719 osd.4 down 0 1.00000
5 ssd 0.11719 osd.5 down 0 1.00000
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 1153.07| 515.29|
|SEQ1M Q1 T1 | 447.35| 142.98|
|RND4K Q32T16| 99.07| 32.19|
|. IOPS | 24186.26| 7859.91|
|. latency us| 21148.94| 65076.23|
|RND4K Q1 T1 | 7.47| 1.48|
|. IOPS | 1823.24| 360.98|
|. latency us| 545.98| 2765.23|
root@pve01:/mnt/test#
3. 多个OSD,单服务器基准测试(测试序列为2)
更改crushmap
设置 step chooseleaf firstn 0 type host
为step chooseleaf firstn 0 type osd
OSD 树
root@pve01:/etc/ceph# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.70312 root default
-3 0.23438 host pve01
0 ssd 0.11719 osd.0 up 1.00000 1.00000
1 ssd 0.11719 osd.1 up 1.00000 1.00000
-5 0.23438 host pve02
2 ssd 0.11719 osd.2 down 0 1.00000
3 ssd 0.11719 osd.3 down 0 1.00000
-7 0.23438 host pve03
4 ssd 0.11719 osd.4 down 0 1.00000
5 ssd 0.11719 osd.5 down 0 1.00000
结果
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 1376.59| 397.29|
|SEQ1M Q1 T1 | 442.74| 111.41|
|RND4K Q32T16| 114.97| 29.08|
|. IOPS | 28068.12| 7099.90|
|. latency us| 18219.04| 72038.06|
|RND4K Q1 T1 | 6.82| 1.04|
|. IOPS | 1665.27| 254.40|
|. latency us| 598.00| 3926.30|
4. 多OSD、多服务器基准测试(测试顺序为1)
OSD 树
root@pve01:/etc/ceph# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.70312 root default
-3 0.23438 host pve01
0 ssd 0.11719 osd.0 up 1.00000 1.00000
1 ssd 0.11719 osd.1 up 1.00000 1.00000
-5 0.23438 host pve02
2 ssd 0.11719 osd.2 up 1.00000 1.00000
3 ssd 0.11719 osd.3 up 1.00000 1.00000
-7 0.23438 host pve03
4 ssd 0.11719 osd.4 up 1.00000 1.00000
5 ssd 0.11719 osd.5 up 1.00000 1.00000
结果
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 1527.37| 296.25|
|SEQ1M Q1 T1 | 408.86| 106.43|
|RND4K Q32T16| 189.20| 43.00|
|. IOPS | 46191.94| 10499.01|
|. latency us| 11068.93| 48709.85|
|RND4K Q1 T1 | 4.99| 0.95|
|. IOPS | 1219.16| 232.37|
|. latency us| 817.51| 4299.14|
结论
- 可以看到ceph的写入性能(106.43MB/s)和物理盘的写入性能(1262.63MB/s)差距巨大,甚至RND4K Q1 T1直接变成了机械硬盘
- 一个或者多个OSD,一台或者多台机器,对ceph影响不大(可能是我的集群数量不够)
- 三节点搭建的ceph集群会导致磁盘读取性能下降一半,写入性能下降四分之一甚至更多
附录
1 - 一些 SSD 基准测试结果
Micron_1 100_MTFDDAK1T0TB SCSI 磁盘设备
G:\fio>python "E:\Programing\PycharmProjects\fio-cdm\fio-cdm"
tests: 5, size: 1.0GiB, target: G:\fio 228.2GiB/953.8GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 363.45| 453.54|
|SEQ1M Q1 T1 | 329.47| 404.09|
|RND4K Q32T16| 196.16| 212.42|
|. IOPS | 47890.44| 51861.48|
|. latency us| 10677.71| 9862.74|
|RND4K Q1 T1 | 20.66| 65.44|
|. IOPS | 5044.79| 15976.40|
|. latency us| 197.04| 61.07|
三星 MZALQ512HALU-000L1
root@pve1:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 3.4GiB/476.7GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 2358.84| 1476.54|
|SEQ1M Q1 T1 | 1702.19| 1291.18|
|RND4K Q32T16| 955.34| 1070.17|
|. IOPS | 233238.46| 261273.09|
|. latency us| 2193.90| 1957.79|
|RND4K Q1 T1 | 55.04| 229.99|
|. IOPS | 13437.11| 56149.97|
|. latency us| 73.17| 16.65|
2——bcache
bcache加速hdd+ssd混盘ceph架构测试结果可以看到READ有明显提升,但是WRITE还是很差
tests: 5, size: 1.0GiB, target: /mnt/test 104.3MiB/10.0GiB
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 | 1652.93| 242.41|
|SEQ1M Q1 T1 | 552.91| 81.16|
|RND4K Q32T16| 429.52| 31.95|
|. IOPS | 104862.76| 7799.72|
|. latency us| 4879.87| 65618.50|
|RND4K Q1 T1 | 13.10| 0.45|
|. IOPS | 3198.16| 110.09|
|. latency us| 310.07| 9077.11|
即使一个磁盘上有多个osd也无法解决WRITE问题
详细测试数据:https://www.reddit.com/r/ceph/comments/xnse2j/comment/j6qs57g/?context=3
如果我使用VMware vSAN,我可以轻松加速hdd到ssd的速度,而且我几乎感觉不到hdd的存在(我没有详细对比过,只是凭感觉)
3-其他学科测试报告分析
我分析比较了几份报告,总结如下
Proxmox-VE_Ceph-Benchmark-201802.pdf
Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf
Dell_R730xd_RedHat_Ceph_Performance_SizingGuide_WhitePaper.pdf
micron_9300_and_red_hat_ceph_reference_architecture.pdf
1)-pve 201802
据悉,测试规模为6台服务器,每台服务器4台三星SM863系列,2.5英寸,240GB SSD,SATA-3(6Gb/s)MLC。
# Samsung SM863 Series, 2.5", 240 GB SSD
# from https://www.samsung.com/us/business/support/owners/product/sm863-series-240gb/
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ?M Q? T? | 520.00| 485.00|
|RND4K Q? T? | ?| ?|
|. IOPS | 97000.00| 20000.00|
报告结果展示
# 3 Node Cluster/ 4 x Samsung SM863 as OSD per Node
# rados bench 60 write -b 4M -t 16
# rados bench 60 read -t 16 (uses 4M from write)
|Name | Read(MB/s)| Write(MB/s)|
# 10 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16| 1064.42| 789.12|
# 100 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16| 3087.82| 1011.63|
可以看出网络带宽对性能的影响是巨大的,虽然万兆网络下性能不足,但至少读写性能已经逼近带宽极限但是,看我的测试结果,写入速度非常差(296.25MB/s)
2)-pve 202009
据报告称,测试规模为 3 台服务器;每台服务器 4 块 Micron 9300 Max 3.2 TB(MTFDHAL3T2TDR);1 块 100 GbE DAC,采用全网状拓扑
# Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR)
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ128KQ32T?| 3500.00| 3100.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?| 3340.00| 840.00| (Estimate according to formula, throughput ~= iops * 4k / 1000)
|. IOPS | 835000.00| 210000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q1 T1 | | 205.82| (from the report)
|. IOPS | | 51000.00| (from the report)
|. latency ms| | 0.02| (from the report)
报告结果展示
# MULTI-VM WORKLOAD (LINUX)
# I don't understand the difference between Thread and Job, and the queue depth is not identified in the document
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q? T1 | 7176.00| 2581.00| (SEQUENTIAL BANDWIDTH BY NUMBER OF JOBS)
|RND4K Q1 T1 | 86.00| 28.99| (Estimate according to formula)
|. IOPS | 21502.00| 7248.00| (RANDOM IO/S BY NUMBER OF JOBS)
同样地,RND4K Q1 T1 WRITE测试结果很差,只有7k iops,而物理磁盘有51k iops,我觉得这是无法接受的。
3)-戴尔R730xd报告
据报告称,测试规模为 5 x Storage Server;每台服务器 12HDD+3SSD,3 x 复制 2 x 10GbE NIC
# Test results extracted from the report
# Figure 8 Throughput/server comparison by using different configurations
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q64T1 | 1150.00| 300.00|
在这种情况下,SEQ4M Q64T1测试结果中的WRITE只有300MB/s左右,大约是单条SAS的2倍,即2 x 158.16 MB/s(4M块)。这让我难以置信。它甚至比我的 nvme 磁盘还要快。但还有一个重要的事实,12*5=60块硬盘的顺序写入速度只有300MB/s,这个性能损失是不是太大了?
4)- 美光报告
据报告称,测试规模为 3 x Storage Server;每台服务器 10 x 微米 9300MAX 12.8T,2 x 复制 ,2 x 100GbE NIC
# micron 9300MAX 12.8T (MTFDHAL12T8TDR-1AT1ZABYY) Physical disk benchmark
|Name | Read(MB/s)| Write(MB/s)| (? is the parameter not given)
|------------|------------|------------|
|SEQ?M Q? T? | 48360.00| ?| (from the report)
|SEQ128KQ32T?| 3500.00| 3500.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?| 3400.00| 1240.00| (Estimate according to formula)
|. IOPS | 850000.00| 310000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|. latency us| 86.00| 11.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q? T? | 8397.77| 1908.11| (Estimate according to formula)
|. IOPS | 2099444.00| 477029.00| (from the report, Executive Summary)
|. latency ms| 1.50| 6.70| (from the report, Executive Summary)
报告结果展示
# (Test results extracted from the report)
|Name | Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|RND4KQ32T100| ?| ?|
|. IOPS | 2099444.00| 477029.00|
|. latency ms| 1.52| 6.71|
# (I don't know if there is a problem reported on the official website. There is no performance loss here)
不得不说美光官方的测试平台太高端了,对于我们中小企业来说,负担不起。从结果来看,WRITE已经接近单块物理盘的性能了。是不是意味着如果只使用单节点单块盘,写入性能将下降到 477k/30=15.9k iops如果是这样,这将是 sata ssd 的性能。
4 - 更多相同问题
- https://forum.proxmox.com/threads/bad-rand-read-write-io-proxmox-ceph.68404/#post-530006
- https://forum.proxmox.com/threads/ceph-performance-with-simple-hardware-slow-writing.96697/#post-529524
- https://forum.proxmox.com/threads/bad-rand-read-write-io-proxmox-ceph.68404/#post-529520
- https://forum.proxmox.com/threads/bad-rand-read-write-io-proxmox-ceph.68404/#post-529486
- https://www.reddit.com/r/ceph/comments/xnse2j/comment/j6qobtv/?utm_source=share&utm_medium=web2x&context=3
- https://www.reddit.com/r/ceph/comments/kioxqx/comment/j6d3sxc/?utm_source=share&utm_medium=web2x&context=3
最后我想知道的问题是:
- ceph 写入性能问题怎么解决?ceph 能达到 VMware vSAN 一样的性能吗?
- 结果显示全闪存盘的性能不如hdd+ssd。那么如果我不使用bcache,应该如何修复ceph全闪存盘的性能问题呢?
- hdd+ssd架构有没有更好的解决方案?