-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boot partition can easily run out of space on upgrade #1247
Comments
related to: coreos/fedora-coreos-docs#410 |
We might want to increase the /boot partition size to 512MB. I vaguely remember that we already talked about that somewhere but I don't remember where. |
A few things here.
I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. |
The FCOS & RHCOS kernel + initrd currently sit at 97M (94M for FCOS). With a 384M |
Interesting idea. A simpler approach I think is to make it a layered initrd instead in Another half-baked idea I've had is to make libostree use an OSTree repo in |
Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.
Right; if the kernel changes, then the initramfs changes. We can hence only share kernels with different initramfs images. In the case where something in the initramfs changed but not the kernel, in theory at max we could shave (3-1)*kernel = ~24MB which isn't bad, but still 6% of the size of current If we figured out how to move parts of the initramfs to So looking at e.g.:
which is 6.2M gzip'd all on its own. Another thing we ship is
Which gzip's to 2.0M, so even just splitting out the ignition and afterburn binaries into "dynamically loaded initramfs from /usr" gets us 3*8.2=~24.6M of savings in the much more common case where ignition and afterburn versions remain the same across all 3 deployments. Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs. In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers. |
A tricky thing here though is we would obviously continue to need to ship a single initramfs image that contains ignition for the PXE case, meaning we'd need to have two initramfs images, but ideally the single one is just a concatentation of the split. |
Another thing to investigate is to compress the initrd with zstd. In a quick test:
So that's a 9M * 3 = 27M of savings. Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.
Hmm, if supporting a soft reset is considered in scope then there's also the fact that the binaries would slowly drift from the base version. Recent discussions about factory reset though have been leaning more towards the harder
I think the idea has a lot of merit. My primary concern would be implementation complexity in an already extremely complex initramfs. |
Previous discussion on the topic also in #855 |
We should also consider xz or a "stronger" compression level for zstd as the decompression time may be negligible considering the size of the initramfs. |
But even if we temporarily fix this issue this way, I think that it's worth giving us a little bit of room by figuring out a way for new installs to use 512M/1G boot partitions. I don't see the size of the kernel or the initrd going down in the future. |
This thread above lists a bunch of stuff we can do to shrink the initramfs. A notable benefit of doing so is booting becomes faster. |
I like those ideas too and we need them to solve the situation for existing installations. However I don't know how much time we will need to make them happen or how complex they will end up to be. Thus if it's reasonably doable to bump the defaults sizes, I think we should consider it too. |
I think we're agreeing here. I would say though that we should split this into two issues:
Conflating them is not good because again I think we must make existing installs work. |
It seems like matching the default from Anaconda would make sense here; it's a huge leap from 384M to 1G, but provides plenty of future proofing. I don't know how that size would play with the non-x86 arches (though a quick look at a RHEL 8.6 |
yes, it is. I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having |
Fair. Maybe it becomes a knob that we can implement, where RHCOS can use a larger size for |
As a counter-argument, Fedora IoT appears to default to 1G for |
This is a proof-of-concept of the idea in coreos/fedora-coreos-tracker#1247 (comment) The role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean the ignition binary has to physically live in the initramfs. In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers. It just has to have enough code to mount the root filesystem, and neither ignition nor afterburn are needed for that. This clearly adds some nontrivial logic to our already nontrivial initramfs. But, it does shave 9M from each copy of the initramfs, so in the likely case of having (transiently) 3 different versions, we will save 27MB in /boot, which is a good amount.
initrd compressionI ran some tests on my system:
The latter options would also require us to ship (Sadly, it's never that simple. bootfs sizeWe wouldn't necessarily have to bump the bootfs all the way to 1 GB. Even if we doubled it, that'd only be 768 MiB. Yes, we'd need to update the Butane template as well, but that should be a small change and there isn't a strong dependency between those steps in either direction. Moving Ignition out of the initrdI have serious concerns about the maintainability of our initramfs as is. There are maybe only a couple people who fully understand how it works. The PoC in coreos/fedora-coreos-config#1834 is reasonably small, but I wonder if there will be further consequences for maintenance, such as complicating debugging. That might be worthwhile if gave us a substantial amount of headroom, but the current PoC saves just 7% of the bootfs, and of course it doesn't reduce our shipping weight at all. We can't further improve this by kicking out NetworkManager and network drivers, since we need those for Tang-bound disk encryption. And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in coreos/cargo-vendor-filterer#22 seem relevant here?) |
Agree that this is two different issues. From the options we have here, it appears that working on zstd compression and increasing the size of For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel. |
I have a rather strong opinion that we must support in-place upgrades for the next (let's say) two years with our current 384M Functional automatic upgrades is a core raison d'être for us - it has to work. The discussion in that issue is really about a situation almost the polar opposite in many ways; cloud storage capacity is...a lot larger. Now I'd restate that...I think ostreedev/ostree#2670 will tide us over for quite a while. It has some drawbacks, but...eh. I also think in the medium term (~year) that we should invest in thinning out the initramfs, and in parallel consider bumping the default size of |
Oh...right. I had not considered this. So to properly handle this we'll need to use the cincinatti upgrade graph. A major advantage then of shrinking the initramfs is it will Just Work without requiring any changes on existing systems. |
Of the options:
Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.
easiest to see the benifits, but doesn't help existing nodes.
As mentioned by others here, I'm worried here about complexity and would prefer we not do this or at least be VERY careful if we do. I'd prefer not modifying what we ship currently, but rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd. This itself would have risks.
This also worries me, but could probably be done safely in a way that minimizes risk and only if we need it. |
I opened #1465 to handle our long term "increase /boot partition size" goal. For our short term goal (and also the long term goal for updating systems where we won't increase the size of We'll let this ticket track the OSTree autopruning work and will thus be closed out when that lands. |
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 Closes: ostreedev#2670
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 Closes: ostreedev#2670
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 Closes: ostreedev#2670
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 Closes: ostreedev#2670
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning. This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 Closes: ostreedev#2670
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning (see [[2]] and following). This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 [2]: ostreedev#2670 (comment) Closes: ostreedev#2670
The PR for OSTree autopruning is ostreedev/ostree#2847 |
During the early design of FCOS and RHCOS, we chose a value of 384M for the boot partition. This turned out to be too small: some arches other than x86_64 have larger initrds, kernel binaries, or additional artifacts (like device tree blobs). We'll likely bump the boot partition size in the future, but we don't want to abandon all the nodes deployed with the current size.[[1]] Because stale entries in `/boot` are cleaned up after new entries are written, there is a window in the update process during which the bootfs temporarily must host all the `(kernel, initrd)` pairs for the union of current and new deployments. This patch determines if the bootfs is capable of holding all the pairs. If it can't but it could hold all the pairs from just the new deployments, the outgoing deployments (e.g. rollbacks) are deleted *before* new deployments are written. This is done by updating the bootloader in two steps to maintain atomicity. Since this is a lot of new logic in an important section of the code, this feature is gated for now behind an environment variable (`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with it, we can consider turning it on by default. This strategy increases the fallibility of the update system since one would no longer be able to rollback to the previous deployment if a bug is present in the bootloader update logic after auto-pruning (see [[2]] and following). This is however mitigated by the fact that the heuristic is opportunistic: the rollback is pruned *only if* it's the only way for the system to update. [1]: coreos/fedora-coreos-tracker#1247 [2]: ostreedev#2670 (comment) Closes: ostreedev#2670
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs becaue the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs becaue the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs becaue the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs becaue the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
While looking at the boot space issue[1], I tested whether stripping binaries saved anything. Currently, dracut tries to strip binaries if `strip` (or `eu-strip`) are available, which isn't the case in FCOS. Adding it as a test, it turns out stripping barely saves anything because we've already split out debug symbols into separate RPMs, and the remaining symbols don't take much space. So let's just tell dracut to stop trying to opportunistically strip anything to be consistent. This then squashes a message emitted about `strip` not being found. [1] coreos/fedora-coreos-tracker#1247
On my system, relative to gzip -9, it roughly halves decompression time at boot while also reducing initrd size. RHCOS 8 can't inherit this because the RHEL 8 kernel doesn't enable CONFIG_RD_ZSTD. This requires coreos-installer to support decoding zstd (for `coreos-installer pxe customize`), which is coreos/coreos-installer#920. For coreos/fedora-coreos-tracker#1247.
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
While looking at the boot space issue[1], I tested whether stripping binaries saved anything. Currently, dracut tries to strip binaries if `strip` (or `eu-strip`) are available, which isn't the case in FCOS. Adding it as a test, it turns out stripping barely saves anything because we've already split out debug symbols into separate RPMs, and the remaining symbols don't take much space. So let's just tell dracut to stop trying to opportunistically strip anything to be consistent. This then squashes a message emitted about `strip` not being found. [1] coreos/fedora-coreos-tracker#1247
On my system, relative to gzip -9, it roughly halves decompression time at boot while also reducing initrd size. RHCOS 8 can't inherit this because the RHEL 8 kernel doesn't enable CONFIG_RD_ZSTD. This requires coreos-installer to support decoding zstd (for `coreos-installer pxe customize`), which is coreos/coreos-installer#920. For coreos/fedora-coreos-tracker#1247.
The ppc64le arch has been blocked [1] from being fully released because of size limitations in /boot not being able to hold 3 copies of kernel+initramfs because the kernel on ppc64le isn't compressed [2]. Now that OSTree Autopruning [3] has landed let's enable it on ppc64le to unblock ourselves. [1] coreos/fedora-coreos-tracker#987 (comment) [2] coreos/fedora-coreos-tracker#1247 (comment) [3] coreos/fedora-coreos-tracker#1495
In #1247 (comment), we agreed to have this ticket track the auto-pruning work and let #1465 track the longer-term of increasing the boot partition. Auto-pruning has long landed now, so let's close this. |
We have a couple of cases where users hit a
error: Installing kernel: regfile copy: No space left on device
when installing a new kernel.see:
https://discussion.fedoraproject.org/t/installing-custom-kernel-on-fedora-coreos/39138
&
https://bugzilla.redhat.com/show_bug.cgi?id=2104619
There are cases where the user can't easily resize /boot should we document/require a larger space for /boot?
Maybe a new troubleshooting entry?
Is there anything we should be doing on rpm-ostree to deal with this?
The text was updated successfully, but these errors were encountered: