Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-csi very slow on vm #9754

Open
plano-fwinkler opened this issue Nov 19, 2024 · 3 comments
Open

ceph-csi very slow on vm #9754

plano-fwinkler opened this issue Nov 19, 2024 · 3 comments

Comments

@plano-fwinkler
Copy link

proxmox with ceph and talos as vm with ceph csi is much slower than openebs-hostpath, are there any modules missing for the kernel?

Environment

  • Talos version: 1.8.2
  • Kubernetes version: 1.31.2
  • Platform: proxmox with ceph storage
@smira
Copy link
Member

smira commented Nov 19, 2024

The issue you posted doesn't have any relevant details, including the performance numbers, the way you set up things, etc.

Ceph is a complicated subject, and setting it up properly is not trivial.

@plano-fwinkler
Copy link
Author

We have a Proxmox Cluster with 5 Nodes and a Ceph Cluster on the Proxmox. The Ceph Cluster has a 100GB nic.

if i testing with kubestr fio:

with a local path Storageclass
`
./kubestr fio -s openebs-hostpath
PVC created kubestr-fio-pvc-qqb7w
Pod created kubestr-fio-pod-4z7zc
Running FIO test (default-fio) on StorageClass (openebs-hostpath) with a PVC of Size (100Gi)
Elapsed time- 28.089900025s
FIO test results:

FIO version - fio-3.36
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=49767.750000 BW(KiB/s)=199087
iops: min=41961 max=61272 avg=49501.585938
bw(KiB/s): min=167847 max=245088 avg=198006.484375

JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=21245.320312 BW(KiB/s)=84993
iops: min=9028 max=39728 avg=35385.707031
bw(KiB/s): min=36112 max=158912 avg=141543.125000

JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=36891.605469 BW(KiB/s)=4722663
iops: min=31849 max=45298 avg=36709.964844
bw(KiB/s): min=4076761 max=5798144 avg=4698881.500000

JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=33320.179688 BW(KiB/s)=4265520
iops: min=17652 max=40996 avg=33119.656250
bw(KiB/s): min=2259456 max=5247488 avg=4239321.500000

Disk stats (read/write):
sda: ios=1454972/1046364 merge=0/22 ticks=1907168/1466570 in_queue=3393654, util=29.229431%

  • OK
    `

and with the ceph block Storageclass: rbd.csi.ceph.com

`
./kubestr fio -s ceph-block
PVC created kubestr-fio-pvc-n7m9z
Pod created kubestr-fio-pod-4jnqw
Running FIO test (default-fio) on StorageClass (ceph-block) with a PVC of Size (100Gi)
Elapsed time- 27.566283667s
FIO test results:

FIO version - fio-3.36
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=242.109741 BW(KiB/s)=983
iops: min=98 max=496 avg=257.322571
bw(KiB/s): min=392 max=1987 avg=1030.129028

JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=224.676819 BW(KiB/s)=914
iops: min=2 max=768 avg=264.464294
bw(KiB/s): min=8 max=3072 avg=1058.357178

JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=213.964386 BW(KiB/s)=27884
iops: min=90 max=462 avg=223.967743
bw(KiB/s): min=11520 max=59254 avg=28694.708984

JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=219.214661 BW(KiB/s)=28548
iops: min=4 max=704 avg=258.035706
bw(KiB/s): min=512 max=90112 avg=33048.785156

Disk stats (read/write):
rbd2: ios=8696/8655 merge=0/267 ticks=2245425/1975831 in_queue=4221257, util=99.504547%

  • OK

`

The talos machine has two nic's. One only to communicating with the ceph monitor's.

It's Working, but i think to slow.

@smira
Copy link
Member

smira commented Nov 19, 2024

Then you need to dig further to understand why - what is the bottleneck, certainly Ceph block storage should be slower as it goes via the network, does replication, etc.

You can watch resource utilization to understand what is the bottleneck.

We are not aware of anything missing from the Talos side, and we do use Ceph a lot ourselves with Talos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants