Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dpavlin
Copy link

@dpavlin dpavlin commented Oct 4, 2018

debootstrap on nodes with 4k storage blocks will create invalid partition table which won't boot inside kvm since it expects partition table with 512b blocks.

PARTITION_ALIGNMENT is currently broken and I opted to remove it all together because with 4k blocks it has VERY different behavior if you are using 2048 or 1M as argument. Since this is triggered only when PARTITION_STYLE="msdos" is used, and 1Mb os enough space i decided to hard-code that.

I'm open to suggestions how to fix this better or in cleaner way.

First commit is change which comes from Debian and is required for recent sfdisk versions, I don't know why this repository doesn't have it. It comes from https://sources.debian.org/patches/ganeti-instance-debootstrap/0.16-2.1/

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@dpavlin
Copy link
Author

dpavlin commented Oct 4, 2018

I signed it!

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@dpavlin
Copy link
Author

dpavlin commented Oct 4, 2018

Oh, it seems that @apoikos first has to submit his part of change because I can't sign licence agreement for him.

@iustin
Copy link
Contributor

iustin commented Oct 4, 2018

@apoikos - good question, do you have any other pending patches in the Debian packages?

@apoikos
Copy link
Member

apoikos commented Oct 4, 2018

@iustin: no, that's the only patch left. I'm fine with including it in this PR.

@apoikos
Copy link
Member

apoikos commented Oct 4, 2018

Of course I mean the only patch left for instance-debootstrap. There are many more for ganeti itself which I'll be upstreaming slowly.

@iustin
Copy link
Contributor

iustin commented Oct 4, 2018

Hmm, I'm not in best position to approve a patch with cla not able to sign it. @apoikos, would you mind much to send your patch separately?

@apoikos
Copy link
Member

apoikos commented Oct 5, 2018

@iustin, submitted #3.

@iustin
Copy link
Contributor

iustin commented Oct 5, 2018

@apoikos, thanks, #3 is merged now. @dpavlin, can you resubmit with just your changes?

@googlebot
Copy link

CLAs look good, thanks!

Since sfdisk doesn't supprot forced chs geometry any more, in situations
where underlaying storage has 4k block, this script will create correct
partition table for host, but not for guest which expects 512b blocks.

This will make machine unbootable, since once kvm starts partition will
be at wrong offset and it won't boot.

Solution is to re-create partition table at end using losetup which has
512b blocks.

Since PARTITION_ALIGNMENT is created using 4k block as base, it will
create filesystem at wrong place if you are using default value of 2048.
This hard-codes host partition table to 1M (which will work on hosts
with both 512b and 4k blocks) and then use 2048 when using 512b blocks.
sfdisk $ARGS --quiet "$1" <<EOF
${PARTITION_ALIGNMENT},,L,*
1M,,L,*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sfdisk man page says:

The default start offset for the first partition  is  1MiB.

What do you think about leaving the value entirely out? And when/if sfdisk defaults change again, this code won't need change too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sfdisk changes default, we would need to change at last re-paritition offset (2048), so I would prefer to have exact values here to be on safe side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. I'm talking about removing the value, i.e. changing this into:

sfdisk $ARGS --quiet "$1" <<EOF
,,,L,*

Which makes sfdisk enter its value. We wouldn't have any PARTITION_OFFSET value anymore.

@@ -127,6 +127,18 @@ map_disk0() {

unmap_disk0() {
kpartx -d -p- $1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, why do you reformat the disk in the umap function? This should just unmap any partitions, but not touch the disk.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that it's right place. I don't want to touch disk until kernel has view of partitions via kpartx, so I opted to put it in unmap. I considered pushing it into CLEANUP, but losetup puts losetup -d in CLEANUP in create and using CLEANUP would involve modifying create also, so it seemed to me like cleaner and smaller patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, sorry to ask, but do you understand how the sequence of actions happen here? In unmap the OS is already installed. Basically this happens:

  1. format disk
  2. map partitions
  3. install OS
  4. run cleanup, which includes unmap.

Why do you want to repartition in cleanup/unmap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-partition after host has finished with filesystem is the fix for 4k problems.

4k drives need 256 as aligment to create first partition at 1M which turns info offset 2048 with 512b blocks which makes them work in kvm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a bit of documentation about it at https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt

Does this explains it better then my attempts so far?

@@ -161,7 +173,6 @@ fi
: ${VARIANTS_DIR:="@sysconfdir@/ganeti/instance-debootstrap/variants"}
: ${GENERATE_CACHE:="yes"}
: ${CLEAN_CACHE:="14"} # number of days to keep a cache file
: ${PARTITION_ALIGNMENT:=2048} # sectors to align partition (1Mib default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could just change here PARTITION_ALIGNMENT to 1M, and not need any changes - plus it'd remain configurable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about this, but in my setup, I had PARTITION_ALIGMENT set to 2048, so we would also need to somehow update existing configuration or convert number (2048) to 1M (or any value which user specified). This seems to me more fragile, and since nobody noticed that it doesn't work, my assumption is that it's not heavily used, and having only one option to configure instead of two seems like simpler solution. Other than rbd which has 4Mb logical IO, I can't think of other storage which would create optimal layout with hard-coded 1M value, and to be honest, I prefer to have partition at 1M offset even on rbd.

Having said all this, I'm willing to write conversion from number to xM notation and keep PARTITION_ALIGMENT in code if you prefer that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal is to make PARTITION_ALIGNMENT default to (string value) "1m", instead of hardcoding it as 1m in the sfdisk invocation. Why would it need conversion at all?

@iustin
Copy link
Contributor

iustin commented Oct 8, 2018

Replying on the general thread since I guess the individual comments will go away once resolved.

Thanks for the doc you pointed to (https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt), I understand now what and how you're trying to fix.

If I understand correctly, the root of the problem is that a 4K-sector drive (under Linux) is still exposed as a 512b-sector drive when booted from via KVM. And you plan to do this by repartitioning the first partition to the same absolute byte offset (1M) after it being mounted by losetup since that exposes a 512b sector similar to how kvm expected. In other words, the partition as first created by the initial sfdisk would have 256 sector offset (256*4K=1MB), after redoing over losetup/512b the offset will be 2048 (2048*512b=same 1MB). Correct?

I see a few problems with this approach.

The first one is that this relies on KVM forever using 512b; if at one point it changes and actually picks up the backing device sector size, you'll break existing instances.

Second, you're also relying on losetup matching the same 512b. If it changes behaviour, you'll again break things because it won't necessarily match KVM. You could pass the -b argument to it, but this seems to only be supported in recent kernels… In any case, you should still pass 1MB to the sfdisk-over-losetup invocation, hardcoding 2048 here doesn't make sense.

And last, is that this is forcing everybody to use 512b sectors, even if not using KVM. The fix should at least add a configurable key to say "force 512b-sector partitions".

IMO the proper fix here would be to make KVM pick up the correct block size (see https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04458.html for example), but if you want to make this purely on the instance-debootstrap side, something along the following lines would be much cleaner:

  • replace 2048 with 1M, or even better, make the value default to empty string, so that sfdisk applies its default offset
  • add a setting to always use losetup, with the appropriate documentation, so instead of partitioning with 4K, formatting, installing, then repartitioning, it simpy wraps everything under losetup so the partitioning is only done once.

What do you think?

@apoikos
Copy link
Member

apoikos commented Oct 10, 2018

I'd like to add a couple of notes:

First of all, this issue affects only setups that use a 4K logical block size. Setups with 4K physical block size and 512-byte logical block size work fine (at least ours does).

The root cause for this issue is that the partition table does not store any information about the sector size; the latter is provided by the disk itself and has to be looked up in order to convert LBA addresses to byte offsets. In our case, the issue arises because we partition the disk on the host (which sees the hardware's 4K logical block size), but go on to use it in the guest (which only sees a 512-byte sector); if we were running debootstrap/partitioning from inside the guest, then no problem would arise — other than the guest trying to do suboptimal 512-byte I/O.

The solution proposed by @dpavlin is to re-partition the device using 512-byte sectors through losetup. This works, but has the following drawbacks:

  • The guest will continue to do suboptimal 512-byte I/O
  • The guest's disk requires special handling to be accessed from inside the host, as kpartx will not work, possibly breaking instance-debootstrap's workflow — for example, instance renames or disk exports will not work as far as I can tell.

Instead of trying to settle this disagreement between guest and host, we could try to make sure that there is no disagreement at all! qemu has supported variable block sizes since 0.13, via the physical_block_size and logical_block_size blockdev options. In theory, all we would need to do is check the backing store's block size — if it's a raw device — using the respective sysfs attributes and add the relevant options to the device in KVM's command line. However, this also has its downsides:

  • It might break existing instances that somehow got a working partition table (e.g. if the OS provider does partitioning from within the instance), so we'd need some way of turning it off
  • I have no idea how this will interact with instance mobility. Imagine for instance a DRBD setup where one side has a 4K backing store and the other one a 512-byte. I'm not sure if DRBD would work in this case, but if it did we might end up using different logical block sizes pre- and post-failover/migration. To really protect against this, we would need to persist the logical block size that was used during instance creation in the instance's config and use it henceforth.

Overall, settling for 512-byte sectors universally is not a bad idea: it's the least common denominator and offers the best compatibility; I guess that's why Qemu does not pass the sector size through by default. To do this however, we must take care to always access the disk using 512-byte sectors, especially from within the host.

@saz
Copy link

saz commented Jul 24, 2024

I'm also running into this issue and things are a bit more complicated.

The fix by @dpavlin works for me, to get instances up and running, but if I need to make any changes on an instance volume, it's broken on hosts with 4k block size:

root@gnt14 /var/log/ganeti/os # kpartx -a /dev/vg0/fc0b9405-bd2a-4f83-8a8e-1c766de2d5a1.disk0
device-mapper: reload ioctl on vg0-fc0b9405--bd2a--4f83--8a8e--1c766de2d5a1.disk0p1 (253:16) failed: Invalid argument
create/reload failed on vg0-fc0b9405--bd2a--4f83--8a8e--1c766de2d5a1.disk0p1

To work around this issue, a loop device can be used:

root@gnt14 /var/log/ganeti/os # losetup --show -f /dev/vg0/fc0b9405-bd2a-4f83-8a8e-1c766de2d5a1.disk0
/dev/loop0
root@gnt14 /var/log/ganeti/os # kpartx -a /dev/loop0
root@gnt14 /var/log/ganeti/os #

Instead of repartition in unmap_disk0, we might just rely on a loop device in all cases, which seems to default to 512 bytes as sector size. This way, we only need to partition the block device once in format_disk0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants