Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS: create volumes with more than 8k blocksize #128

Closed
ggzengel opened this issue Apr 23, 2020 · 12 comments
Closed

ZFS: create volumes with more than 8k blocksize #128

ggzengel opened this issue Apr 23, 2020 · 12 comments
Labels
enhancement New feature or request

Comments

@ggzengel
Copy link

If you have a ZFS RAID pool with ashift=12 and more than 3 (+Parity) HDDs the blocksize should be more than 8k.
Here is a thread which describes the problem:
https://forum.proxmox.com/threads/zfs-replica-2x-larger-than-original.49801/

So please add -o volblocksize= while creating the volume.
If you have x + parity HDDs then
blocksize = 2 ^ floor(log2(x)) * 2 ^ ashift

If you have 16 disk with RAIDZ3 and ashift=12 => x=(16-3)=13 =>
floor(log2(13)) = 3 =>
blocksize = 2 ^ 3 * 2 ^ 12 =>
blocksize = 32k

@rp- rp- added the enhancement New feature or request label May 22, 2020
@ggzengel
Copy link
Author

ggzengel commented Aug 6, 2020

Please can you add something like StorDriver/LvcreateOptions only for zfs create.

@ghernadi
Copy link
Contributor

Please can you add something like StorDriver/LvcreateOptions only for zfs create.

Yes. StorDriver/ZfscreateOptions will be included in the next release.

However, even with this property you will have to manually calculate the wanted volblocksize and set the mentioned property accordingly.

@ggzengel
Copy link
Author

Thanks. This will help a lot.

However, even with this property you will have to manually calculate the wanted volblocksize and set the mentioned property accordingly.

I have no problem with the calculation. Can you put it to the documentation for others?

@ggzengel
Copy link
Author

I did a test with 32G VHD and I got:

'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000
Error message:
cannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size

Why do you add (33.561.640−32×2^20) = 7208k Bytes?
Can you add 7.424k (mod 256k = 0) or 8192k (mod 1024k = 0)?

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

# rg lp zfs_12
┊ StorDriver/ZfscreateOptions ┊ -o volblocksize=16k ┊
ERROR REPORT 5F429B8D-F67D5-000005

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Satellite
Version:                            1.8.0
Build ID:                           e56b6c2a80b6d000921a998e3ba4cd1102fbdd39
Build time:                         2020-08-17T13:02:52+00:00
Error time:                         2020-08-23 22:46:09
Node:                               px1.scr-wi.local

============================================================

Reported error:
===============

Description:
    Failed to create zfsvolume
Additional information:
    Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. 

    Standard out: 


    Error message: 
    cannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size


Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69

Error message:                      Failed to create zfsvolume

Error context:
    An error occurred while processing resource 'Node: 'px1', Rsc: 'vm-107-disk-1''

Call backtrace:

    Method                                   Native Class:Line number
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:69
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:104
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:64
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:52
    create                                   N      com.linbit.linstor.layer.storage.zfs.utils.ZfsCommands:86
    createLvImpl                             N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:208
    createLvImpl                             N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:61
    createVolumes                            N      com.linbit.linstor.layer.storage.AbsStorageProvider:387
    process                                  N      com.linbit.linstor.layer.storage.AbsStorageProvider:299
    process                                  N      com.linbit.linstor.layer.storage.StorageLayer:279
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:763
    processChild                             N      com.linbit.linstor.layer.drbd.DrbdLayer:448
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:565
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:383
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:763
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:309
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:145
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
    run                                      N      java.lang.Thread:834


END OF ERROR REPORT.
update VM 107: -scsi1 ZFS_DRBD_12:32
TASK ERROR: error during cfs-locked 'storage-ZFS_DRBD_12' operation: API Return-Code: 500. Message: Could not create resource definition vm-107-disk-1 from resource group zfs_12, because: [{"ret_code":20447233,"message":"Successfully set property key(s): StorDriver/ZfscreateOptions","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":19922945,"message":"Volume definition with number '0' successfully  created in resource definition 'vm-107-disk-1'.","obj_refs":{"RscGrp":"zfs_12","RscDfn":"vm-107-disk-1","VlmNr":"0"}},{"ret_code":20447233,"message":"New resource definition 'vm-107-disk-1' created.","details":"Resource definition 'vm-107-disk-1' UUID is: 39e74123-008a-4a15-85e8-a4ab894e94ed","obj_refs":{"RscGrp":"zfs_12","UUID":"39e74123-008a-4a15-85e8-a4ab894e94ed","RscDfn":"vm-107-disk-1"}},{"ret_code":20185089,"message":"Successfully set property key(s): StorPoolName","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":20185089,"message":"Successfully set property key(s): StorPoolName","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":21233665,"message":"Resource 'vm-107-disk-1' successfully autoplaced on 2 nodes","details":"Used nodes (storage pool name): 'px1 (zfs_12)', 'px2 (zfs_12)'","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":-4611686018406153242,"message":"(Node: 'px2') Failed to create zfsvolume","details":"Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size\n\n","error_report_ids":["5F42A019-520BA-000000"],"obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":-4611686018406153242,"message":"(Node: 'px1') Failed to create zfsvolume","details":"Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size\n\n","error_report_ids":["5F429B8D-F67D5-000000"],"obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}}]  at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 282. 	PVE::Storage::Custom::LINSTORPlugin::alloc_image("PVE::Storage::Custom::LINSTORPlugin", "ZFS_DRBD_12", HASH(0x55f256945fd0), 107, "raw", undef, 33554432) called at /usr/share/perl5/PVE/Storage.pm line 824 	eval {...} called at /usr/share/perl5/PVE/Storage.pm line 824 	PVE::Storage::__ANON__() called at /usr/share/perl5/PVE/Cluster.pm line 614 	eval {...} called at /usr/share/perl5/PVE/Cluster.pm line 582 	PVE::Cluster::__ANON__("storage-ZFS_DRBD_12", undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Cluster.pm line 659 	PVE::Cluster::cfs_lock_storage("ZFS_DRBD_12", undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Storage/Plugin.pm line 461 	PVE::Storage::Plugin::cluster_lock_storage("PVE::Storage::Custom::LINSTORPlugin", "ZFS_DRBD_12", 1, undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Storage.pm line 829 	PVE::Storage::vdisk_alloc(HASH(0x55f256939fa0), "ZFS_DRBD_12", 107, "raw", undef, 33554432) called at /usr/share/perl5/PVE/API2/Qemu.pm line 188 	PVE::API2::Qemu::__ANON__("scsi1", HASH(0x55f2568b9d48)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 461 	PVE::AbstractConfig::foreach_volume_full("PVE::QemuConfig", HASH(0x55f256a8c238), undef, CODE(0x55f24f98f6d8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 470 	PVE::AbstractConfig::foreach_volume("PVE::QemuConfig", HASH(0x55f256a8c238), CODE(0x55f24f98f6d8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 221 	eval {...} called at /usr/share/perl5/PVE/API2/Qemu.pm line 221 	PVE::API2::Qemu::__ANON__(PVE::RPCEnvironment=HASH(0x55f2568adba0), "root\@pam", HASH(0x55f2568a8878), "x86_64", HASH(0x55f256939fa0), 107, undef, HASH(0x55f256a8c238)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1269 	PVE::API2::Qemu::__ANON__("UPID:px2:00003985:001F27D9:5F42EF9A:qmconfig:107:root\@pam:") called at /usr/share/perl5/PVE/RESTEnvironment.pm line 610 	eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line 601 	PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x55f2568adba0), "qmconfig", 107, "root\@pam", CODE(0x55f256a915e8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1319 	PVE::API2::Qemu::__ANON__() called at /usr/share/perl5/PVE/AbstractConfig.pm line 285 	PVE::AbstractConfig::__ANON__() called at /usr/share/perl5/PVE/Tools.pm line 213 	eval {...} called at /usr/share/perl5/PVE/Tools.pm line 213 	PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-107.conf", 10, 0, CODE(0x55f2568ae758)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 288 	PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 107, 10, 0, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 308 	PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 107, 10, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 316 	PVE::AbstractConfig::lock_config("PVE::QemuConfig", 107, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1348 	PVE::API2::Qemu::__ANON__(HASH(0x55f2568c0d40)) called at /usr/share/perl5/PVE/RESTHandler.pm line 453 	PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x55f254894970), HASH(0x55f2568c0d40)) called at /usr/share/perl5/PVE/HTTPServer.pm line 177 	eval {...} called at /usr/share/perl5/PVE/HTTPServer.pm line 140 	PVE::HTTPServer::rest_handler(PVE::HTTPServer=HASH(0x55f2568adc60), "172.19.36.101", "POST", "/nodes/px2/qemu/107/config", HASH(0x55f2568c0f38), HASH(0x55f25690e190), "extjs") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 746 	eval {...} called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 720 	PVE::APIServer::AnyEvent::handle_api2_request(PVE::HTTPServer=HASH(0x55f2568adc60), HASH(0x55f256a4e660), HASH(0x55f2568c0f38), "POST", "/api2/extjs/nodes/px2/qemu/107/config") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 974 	eval {...} called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 966 	PVE::APIServer::AnyEvent::handle_request(PVE::HTTPServer=HASH(0x55f2568adc60), HASH(0x55f256a4e660), HASH(0x55f2568c0f38), "POST", "/api2/extjs/nodes/px2/qemu/107/config") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1373 	PVE::APIServer::AnyEvent::__ANON__(AnyEvent::Handle=HASH(0x55f256945c70), "scsi1=ZFS_DRBD_12%3A32&digest=0801d8a753fb783a9d5ba47413ede61"...) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 1505 	AnyEvent::Handle::__ANON__(AnyEvent::Handle=HASH(0x55f256945c70)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 1315 	AnyEvent::Handle::_drain_rbuf(AnyEvent::Handle=HASH(0x55f256945c70)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 2015 	AnyEvent::Handle::__ANON__(EV::IO=SCALAR(0x55f256a8c508), 1) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Impl/EV.pm line 88 	eval {...} called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Impl/EV.pm line 88 	AnyEvent::CondVar::Base::_wait(AnyEvent::CondVar=HASH(0x55f255fd2ee0)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent.pm line 2026 	AnyEvent::CondVar::Base::recv(AnyEvent::CondVar=HASH(0x55f255fd2ee0)) called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1660 	PVE::APIServer::AnyEvent::run(PVE::HTTPServer=HASH(0x55f2568adc60)) called at /usr/share/perl5/PVE/Service/pvedaemon.pm line 52 	PVE::Service::pvedaemon::run(PVE::Service::pvedaemon=HASH(0x55f2568a8a28)) called at /usr/share/perl5/PVE/Daemon.pm line 171 	eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 171 	PVE::Daemon::__ANON__(PVE::Service::pvedaemon=HASH(0x55f2568a8a28)) called at /usr/share/perl5/PVE/Daemon.pm line 391 	eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 380 	PVE::Daemon::__ANON__(PVE::Service::pvedaemon=HASH(0x55f2568a8a28), undef) called at /usr/share/perl5/PVE/Daemon.pm line 552 	eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 550 	PVE::Daemon::start(PVE::Service::pvedaemon=HASH(0x55f2568a8a28), undef) called at /usr/share/perl5/PVE/Daemon.pm line 661 	PVE::Daemon::__ANON__(HASH(0x55f24f985fd0)) called at /usr/share/perl5/PVE/RESTHandler.pm line 453 	PVE::RESTHandler::handle("PVE::Service::pvedaemon", HASH(0x55f2568a8d70), HASH(0x55f24f985fd0)) called at /usr/share/perl5/PVE/RESTHandler.pm line 865 	eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 848 	PVE::RESTHandler::cli_handler("PVE::Service::pvedaemon", "pvedaemon start", "start", ARRAY(0x55f24fcba5d8), ARRAY(0x55f24f9a6050), undef, undef, undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 591 	PVE::CLIHandler::__ANON__(ARRAY(0x55f24f9861f8), CODE(0x55f24fd03108), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 668 	PVE::CLIHandler::run_cli_handler("PVE::Service::pvedaemon", "prepare", CODE(0x55f24fd03108)) called at /usr/bin/pvedaemon line 27

@ghernadi
Copy link
Contributor

For some reasons ZFS does not do the rounding of the volsize by itself, so Linstor has to do it. Doing so, I had to add a check for this new ZfscreateOptions property if it modifies the volblocksize, but by mistake I only added a check for -b but not for
-o volblocksize=.

This will be fixed in the next release. Until then, please use -b 16K
I'll reopen and leave this ticket open until verified as fixed.

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

I am not sure what do you mean by this?

@ghernadi ghernadi reopened this Aug 24, 2020
@ggzengel
Copy link
Author

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

I am not sure what do you mean by this?

I meant, if you increase the size you will do this for LINSTOR/DRBD metadata. Where do you put the metadata?
If you put the metadata in front I can't use the zvol as native proxmox zvol.

If there is something with LINSTOR on upgrade or defect database or something else I can use the zvol native with zfs rename and patching the vm*.conf.
It's like using RAID1 disks with plain SATA controllers, because the RAID controller writes the metadata at the end of the disks.

@ghernadi
Copy link
Contributor

If DRBD is using internal metadata, DRBD writes them at the end of the device as stated in the docs

@ggzengel
Copy link
Author

Thanks for the link.
An other work around is using external metadata, because proxmox's volsize is always mod 1GB=0.
Does this work with zfs or do you always increase volsize?

I normally use 2 Intel Optane as ZIL with underlying LVM. So I could use them as metadata store, too?

@ghernadi
Copy link
Contributor

Does this work with zfs or do you always increase volsize?

That should work. In Linstor currently only DRBD with internal metadata and the Luks layer need additional space for metadata (although Luks constantly requires 16MB, which should be fine with ordinary blocksizes :) )

I normally use 2 Intel Optane as ZIL with underlying LVM. So I could use them as metadata store, too?

Yep, sounds like a good idea.

@ggzengel
Copy link
Author

ggzengel commented Aug 24, 2020

Now I did this with a workaround from #176 and LINBIT/linstor-client/issues/42:

sp c lvm px1 zfs_12_meta VG1
sp c lvm px2 zfs_12_meta VG1
linstor sp sp px1 zfs_12_meta StorDriver/LvcreateOptions "-m 1 VG1 /dev/nvme0n1 /dev/nvme1n1"
linstor sp sp px2 zfs_12_meta StorDriver/LvcreateOptions "-m 1 VG1 /dev/nvme0n1 /dev/nvme1n1"
rg sp zfs_12 StorPoolNameDrbdMeta zfs_12_meta
rg sp zfs_12 DrbdMetaType external
linstor rg sp zfs_12 StorDriver/ZfscreateOptions "-o volblocksize=16k"

Here is the reward:
Saved 50% space and doubled the speed, because ZFS used only one half of the vdev (ZFS is strange on this).

# zfs list zpool1/proxmox/drbd/vm-107-disk-1_00000 zpool1/proxmox/drbd/vm-107-disk-2_00000 -o name,volblocksize,used,volsize,refreservation,usedbyrefreservation
NAME                                     VOLBLOCK   USED  VOLSIZE  REFRESERV  USEDREFRESERV
zpool1/proxmox/drbd/vm-107-disk-1_00000       16K  37.1G      32G      37.1G          37.1G
zpool1/proxmox/drbd/vm-107-disk-2_00000        8K  74.1G      32G      74.1G          74.1G
# pvs -o +lv_name | grep 107
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rmeta_0] 
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rimage_0]
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rmeta_0] 
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rimage_0]
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rmeta_1] 
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rimage_1]
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rmeta_1] 
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rimage_1]

Does somebody from Proxmox (@Fabian-Gruenbichler?) put this to Proxmox docs for other people?

@Fabian-Gruenbichler
Copy link

Does somebody from Proxmox (@Fabian-Gruenbichler?) put this to Proxmox docs for other people?

we have a (not-yet-updated and thus not-yet-merged) patch for our docs for the general 'raidz + zvol => high space usage overhead with default settings' issue, which we will include at some point in our reference documentation. I don't think we'll add linstore specific hints to our documentation, as that integration and plugin is not developed by us.

@rp-
Copy link
Contributor

rp- commented Jun 2, 2022

the zfs block size can be specified with the following property setter:
linstor storage-pool set-property <node> <pool> StorDriver/ZfscreateOptions "-b 32k"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants