-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARTITION_STYLE="msdos" broken on storage with 4k blocks #2
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
I signed it! |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
Oh, it seems that @apoikos first has to submit his part of change because I can't sign licence agreement for him. |
@apoikos - good question, do you have any other pending patches in the Debian packages? |
@iustin: no, that's the only patch left. I'm fine with including it in this PR. |
Of course I mean the only patch left for instance-debootstrap. There are many more for ganeti itself which I'll be upstreaming slowly. |
Hmm, I'm not in best position to approve a patch with cla not able to sign it. @apoikos, would you mind much to send your patch separately? |
13ab06f
to
de37352
Compare
CLAs look good, thanks! |
de37352
to
d228d45
Compare
Since sfdisk doesn't supprot forced chs geometry any more, in situations where underlaying storage has 4k block, this script will create correct partition table for host, but not for guest which expects 512b blocks. This will make machine unbootable, since once kvm starts partition will be at wrong offset and it won't boot. Solution is to re-create partition table at end using losetup which has 512b blocks. Since PARTITION_ALIGNMENT is created using 4k block as base, it will create filesystem at wrong place if you are using default value of 2048. This hard-codes host partition table to 1M (which will work on hosts with both 512b and 4k blocks) and then use 2048 when using 512b blocks.
d228d45
to
ef85e0b
Compare
sfdisk $ARGS --quiet "$1" <<EOF | ||
${PARTITION_ALIGNMENT},,L,* | ||
1M,,L,* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sfdisk man page says:
The default start offset for the first partition is 1MiB.
What do you think about leaving the value entirely out? And when/if sfdisk defaults change again, this code won't need change too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sfdisk changes default, we would need to change at last re-paritition offset (2048), so I would prefer to have exact values here to be on safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. I'm talking about removing the value, i.e. changing this into:
sfdisk $ARGS --quiet "$1" <<EOF
,,,L,*
Which makes sfdisk enter its value. We wouldn't have any PARTITION_OFFSET value anymore.
@@ -127,6 +127,18 @@ map_disk0() { | |||
|
|||
unmap_disk0() { | |||
kpartx -d -p- $1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, why do you reformat the disk in the umap function? This should just unmap any partitions, but not touch the disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that it's right place. I don't want to touch disk until kernel has view of partitions via kpartx, so I opted to put it in unmap. I considered pushing it into CLEANUP, but losetup puts losetup -d in CLEANUP in create and using CLEANUP would involve modifying create also, so it seemed to me like cleaner and smaller patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, sorry to ask, but do you understand how the sequence of actions happen here? In unmap the OS is already installed. Basically this happens:
- format disk
- map partitions
- install OS
- run cleanup, which includes unmap.
Why do you want to repartition in cleanup/unmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This re-partition after host has finished with filesystem is the fix for 4k problems.
4k drives need 256 as aligment to create first partition at 1M which turns info offset 2048 with 512b blocks which makes them work in kvm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a bit of documentation about it at https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt
Does this explains it better then my attempts so far?
@@ -161,7 +173,6 @@ fi | |||
: ${VARIANTS_DIR:="@sysconfdir@/ganeti/instance-debootstrap/variants"} | |||
: ${GENERATE_CACHE:="yes"} | |||
: ${CLEAN_CACHE:="14"} # number of days to keep a cache file | |||
: ${PARTITION_ALIGNMENT:=2048} # sectors to align partition (1Mib default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could just change here PARTITION_ALIGNMENT to 1M, and not need any changes - plus it'd remain configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about this, but in my setup, I had PARTITION_ALIGMENT set to 2048, so we would also need to somehow update existing configuration or convert number (2048) to 1M (or any value which user specified). This seems to me more fragile, and since nobody noticed that it doesn't work, my assumption is that it's not heavily used, and having only one option to configure instead of two seems like simpler solution. Other than rbd which has 4Mb logical IO, I can't think of other storage which would create optimal layout with hard-coded 1M value, and to be honest, I prefer to have partition at 1M offset even on rbd.
Having said all this, I'm willing to write conversion from number to xM notation and keep PARTITION_ALIGMENT in code if you prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal is to make PARTITION_ALIGNMENT default to (string value) "1m", instead of hardcoding it as 1m in the sfdisk invocation. Why would it need conversion at all?
Replying on the general thread since I guess the individual comments will go away once resolved. Thanks for the doc you pointed to (https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt), I understand now what and how you're trying to fix. If I understand correctly, the root of the problem is that a 4K-sector drive (under Linux) is still exposed as a 512b-sector drive when booted from via KVM. And you plan to do this by repartitioning the first partition to the same absolute byte offset (1M) after it being mounted by losetup since that exposes a 512b sector similar to how kvm expected. In other words, the partition as first created by the initial sfdisk would have 256 sector offset ( I see a few problems with this approach. The first one is that this relies on KVM forever using 512b; if at one point it changes and actually picks up the backing device sector size, you'll break existing instances. Second, you're also relying on losetup matching the same 512b. If it changes behaviour, you'll again break things because it won't necessarily match KVM. You could pass the And last, is that this is forcing everybody to use 512b sectors, even if not using KVM. The fix should at least add a configurable key to say "force 512b-sector partitions". IMO the proper fix here would be to make KVM pick up the correct block size (see https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04458.html for example), but if you want to make this purely on the instance-debootstrap side, something along the following lines would be much cleaner:
What do you think? |
I'd like to add a couple of notes: First of all, this issue affects only setups that use a 4K logical block size. Setups with 4K physical block size and 512-byte logical block size work fine (at least ours does). The root cause for this issue is that the partition table does not store any information about the sector size; the latter is provided by the disk itself and has to be looked up in order to convert LBA addresses to byte offsets. In our case, the issue arises because we partition the disk on the host (which sees the hardware's 4K logical block size), but go on to use it in the guest (which only sees a 512-byte sector); if we were running debootstrap/partitioning from inside the guest, then no problem would arise — other than the guest trying to do suboptimal 512-byte I/O. The solution proposed by @dpavlin is to re-partition the device using 512-byte sectors through
Instead of trying to settle this disagreement between guest and host, we could try to make sure that there is no disagreement at all! qemu has supported variable block sizes since 0.13, via the
Overall, settling for 512-byte sectors universally is not a bad idea: it's the least common denominator and offers the best compatibility; I guess that's why Qemu does not pass the sector size through by default. To do this however, we must take care to always access the disk using 512-byte sectors, especially from within the host. |
I'm also running into this issue and things are a bit more complicated. The fix by @dpavlin works for me, to get instances up and running, but if I need to make any changes on an instance volume, it's broken on hosts with 4k block size:
To work around this issue, a loop device can be used:
Instead of repartition in |
debootstrap on nodes with 4k storage blocks will create invalid partition table which won't boot inside kvm since it expects partition table with 512b blocks.
PARTITION_ALIGNMENT is currently broken and I opted to remove it all together because with 4k blocks it has VERY different behavior if you are using 2048 or 1M as argument. Since this is triggered only when PARTITION_STYLE="msdos" is used, and 1Mb os enough space i decided to hard-code that.
I'm open to suggestions how to fix this better or in cleaner way.
First commit is change which comes from Debian and is required for recent sfdisk versions, I don't know why this repository doesn't have it. It comes from https://sources.debian.org/patches/ganeti-instance-debootstrap/0.16-2.1/