Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot partition can easily run out of space on upgrade #1247

Closed
jmarrero opened this issue Jul 6, 2022 · 49 comments
Closed

Boot partition can easily run out of space on upgrade #1247

jmarrero opened this issue Jul 6, 2022 · 49 comments
Assignees

Comments

@jmarrero
Copy link
Member

jmarrero commented Jul 6, 2022

We have a couple of cases where users hit a error: Installing kernel: regfile copy: No space left on device when installing a new kernel.
see:
https://discussion.fedoraproject.org/t/installing-custom-kernel-on-fedora-coreos/39138
&
https://bugzilla.redhat.com/show_bug.cgi?id=2104619

There are cases where the user can't easily resize /boot should we document/require a larger space for /boot?

Maybe a new troubleshooting entry?

Is there anything we should be doing on rpm-ostree to deal with this?

@jmarrero
Copy link
Member Author

jmarrero commented Jul 6, 2022

related to: coreos/fedora-coreos-docs#410

@travier
Copy link
Member

travier commented Jul 7, 2022

We might want to increase the /boot partition size to 512MB. I vaguely remember that we already talked about that somewhere but I don't remember where.

@jmarrero
Copy link
Member Author

jmarrero commented Jul 7, 2022

@cgwalters
Copy link
Member

A few things here.

  • Bumping the size isn't going to help anyone with an existing provisioned system; only newly provisioned systems will get the change.
  • One thing we can and probably should do on the ostree side is detect this case and proactively remove the rollback deployment at least - that on its own would probably get us out of the immediate problem
  • It may still be the case that we should bump to 512M but I'd like to see at least a little bit of analysis of the size of the kernel/initramfs - did they actually grow, and by how much and what's there? IOW, is the rate of growth high enough that a year from now 512M will seem small and we're in the same boat again? (Related to this, AIUI Anaconda today uses 1G for /boot for this reason)

I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. /usr/lib/coreos/initramfs-firstboot.img - then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.

@jlebon
Copy link
Member

jlebon commented Jul 7, 2022

The FCOS & RHCOS kernel + initrd currently sit at 97M (94M for FCOS). With a 384M /boot, that's enough for 3 (kernel, initrd) pairs only, which by default libostree uses all three (staged, booted, and rollback). Even just one additional pinned deployment will blow it up. So I think even if we don't project the kernel or initrd growing by much, we should still grow the bootfs.

@jlebon
Copy link
Member

jlebon commented Jul 7, 2022

I've had a half-baked thought/design for a while that we should split out the Ignition bits from the initramfs into a separate file that lives in e.g. /usr/lib/coreos/initramfs-firstboot.img - then on firstboot, we mount the real root, add all that stuff to the initramfs, then continue. (To make this work sanely we'd still need the systemd units in the initramfs, we're just dynamically loading the binaries). A notable benefit of such a change would be that subsequent boots should be faster because we need to decompress a much smaller initramfs.

Interesting idea. A simpler approach I think is to make it a layered initrd instead in /boot. I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot. We should calculate first how much we'd actually shave off this way.

Another half-baked idea I've had is to make libostree use an OSTree repo in /boot too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).

@cgwalters
Copy link
Member

I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.

Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.

Another half-baked idea I've had is to make libostree use an OSTree repo in /boot too so we save on duplicate kernels and initrds (though in practice it wouldn't help too much since almost every FCOS/RHCOS update has a kernel update, but it'd help e.g. Silverblue and other faster moving systems).

Right; if the kernel changes, then the initramfs changes. We can hence only share kernels with different initramfs images. In the case where something in the initramfs changed but not the kernel, in theory at max we could shave (3-1)*kernel = ~24MB which isn't bad, but still 6% of the size of current /boot.

If we figured out how to move parts of the initramfs to /usr the space savings in /boot multiplies by 3, not 2.

So looking at e.g.:

19M	usr/bin/ignition

which is 6.2M gzip'd all on its own.

Another thing we ship is

4.5M	usr/bin/afterburn

Which gzip's to 2.0M, so even just splitting out the ignition and afterburn binaries into "dynamically loaded initramfs from /usr" gets us 3*8.2=~24.6M of savings in the much more common case where ignition and afterburn versions remain the same across all 3 deployments.

Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.

In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.

@cgwalters
Copy link
Member

A tricky thing here though is we would obviously continue to need to ship a single initramfs image that contains ignition for the PXE case, meaning we'd need to have two initramfs images, but ideally the single one is just a concatentation of the split.

@jlebon
Copy link
Member

jlebon commented Jul 7, 2022

Another thing to investigate is to compress the initrd with zstd. In a quick test:

[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd
[root@cosa-devsh tmp]# ls -lh /var/tmp/initrd*
-rw-r--r--. 1 root root 140M Jul  7 20:55 /var/tmp/initrd.img
-rw-r--r--. 1 root root  77M Jul  7 20:55 /var/tmp/initrd.img.gz
-rw-r--r--. 1 root root  68M Jul  7 20:51 /var/tmp/initrd.img.zstd

So that's a 9M * 3 = 27M of savings.

Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.

I bet we could even have it be another var similar to $ignition_firstboot but which expands to e.g. initrd /boot/coreos/firstboot.img in the BLS. (That wouldn't work on s390x though.) And then delete it on firstboot.

Entirely removing the firstboot code would make it harder to factory reset or at least a "soft factory reset" that only e.g. reruns ignition files.

Hmm, if supporting a soft reset is considered in scope then there's also the fact that the binaries would slowly drift from the base version. Recent discussions about factory reset though have been leaning more towards the harder dd kind.

Another way to say all this is...the role of the initramfs originally was just to mount the root filesystem. Us running ignition from the initramfs makes sense, but it doesn't mean ignition has to physically live in the initramfs.

In the end state our initramfs for example doesn't need to physically contain NetworkManager for example either. Or for that matter, kernel network drivers.

I think the idea has a lot of merit. My primary concern would be implementation complexity in an already extremely complex initramfs.

@travier
Copy link
Member

travier commented Jul 8, 2022

Previous discussion on the topic also in #855

@travier
Copy link
Member

travier commented Jul 8, 2022

Another thing to investigate is to compress the initrd with zstd. In a quick test:

[root@cosa-devsh tmp]# /usr/lib/dracut/skipcpio /boot/ostree/fedora-coreos-3a5adc56c49a5e3392e03cb3475e92d466680e5a0b0533768365d964bc758843/initramfs-5.18.9-200.fc36.x86_64.img > /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# gunzip -k /var/tmp/initrd.img.gz
[root@cosa-devsh tmp]# zstd -19 -T0 /var/tmp/initrd.img -o /var/tmp/initrd.img.zstd
[root@cosa-devsh tmp]# ls -lh /var/tmp/initrd*
-rw-r--r--. 1 root root 140M Jul  7 20:55 /var/tmp/initrd.img
-rw-r--r--. 1 root root  77M Jul  7 20:55 /var/tmp/initrd.img.gz
-rw-r--r--. 1 root root  68M Jul  7 20:51 /var/tmp/initrd.img.zstd

So that's a 9M * 3 = 27M of savings.

Though it's currently only supported in the Fedora kernel. There's also bzip2 and xz, but the decompression lag on boot might not be acceptable.

We should also consider xz or a "stronger" compression level for zstd as the decompression time may be negligible considering the size of the initramfs.

@travier
Copy link
Member

travier commented Jul 8, 2022

But even if we temporarily fix this issue this way, I think that it's worth giving us a little bit of room by figuring out a way for new installs to use 512M/1G boot partitions. I don't see the size of the kernel or the initrd going down in the future.

@cgwalters
Copy link
Member

I don't see the size of the kernel or the initrd going down in the future.

This thread above lists a bunch of stuff we can do to shrink the initramfs. A notable benefit of doing so is booting becomes faster.

@travier
Copy link
Member

travier commented Jul 8, 2022

I like those ideas too and we need them to solve the situation for existing installations. However I don't know how much time we will need to make them happen or how complex they will end up to be. Thus if it's reasonably doable to bump the defaults sizes, I think we should consider it too.

@cgwalters
Copy link
Member

I think we're agreeing here.

I would say though that we should split this into two issues:

Conflating them is not good because again I think we must make existing installs work.

@miabbott
Copy link
Member

miabbott commented Jul 8, 2022

Evaluating bumping the size for the future

It seems like matching the default from Anaconda would make sense here; it's a huge leap from 384M to 1G, but provides plenty of future proofing.

I don't know how that size would play with the non-x86 arches (though a quick look at a RHEL 8.6 ppc64le system shows it is using 1G for /boot)

@dustymabe
Copy link
Member

it's a huge leap from 384M to 1G

yes, it is.

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

@miabbott
Copy link
Member

miabbott commented Jul 8, 2022

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

Fair. Maybe it becomes a knob that we can implement, where RHCOS can use a larger size for /boot?

@miabbott
Copy link
Member

miabbott commented Jul 8, 2022

I think part of the perception (about FCOS at least) is that it has some "minimal" aspects and I think having /boot/ take up 1G is going to be a bit against that.

As a counter-argument, Fedora IoT appears to default to 1G for /boot as well

cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this issue Jul 8, 2022
This is a proof-of-concept of the idea in
coreos/fedora-coreos-tracker#1247 (comment)

The role of the initramfs originally was just to mount the root filesystem.
Us running ignition from the initramfs makes sense, but it doesn't mean
the ignition binary has to physically live in the initramfs.

In the end state our initramfs for example doesn't need to physically
contain NetworkManager for example either. Or for that matter, kernel network drivers.

It just has to have enough code to mount the root filesystem, and
neither ignition nor afterburn are needed for that.

This clearly adds some nontrivial logic to our already nontrivial
initramfs.  But, it does shave 9M from each copy of the initramfs,
so in the likely case of having (transiently) 3 different versions,
we will save 27MB in /boot, which is a good amount.
@cgwalters
Copy link
Member

PoC in coreos/fedora-coreos-config#1834

@bgilbert
Copy link
Contributor

bgilbert commented Jul 9, 2022

initrd compression

I ran some tests on my system:

Algorithm initrd size (MiB) Approx. decompression time at boot (ms)
gzip (-9) 104 850
xz (-6) 88 3250
zstd (-15) 94 330
zstd -19 88 400

The latter options would also require us to ship zstd, which is 1.5 MiB. But overall it seems advisable to switch to zstd on FCOS and RHCOS 9.

(Sadly, it's never that simple. coreos-installer pxe customize assumes it can parse the entire initrd, so it'd need zstd support. There's no pure Rust zstd implementation, but the zstd crate can optionally statically link with its own bundled copy of libzstd. In RHCOS we bundle Rust dependencies but not C ones, so the coreos-installer binary on mirror.openshift.com might need to gain a libzstd dependency.)

bootfs size

We wouldn't necessarily have to bump the bootfs all the way to 1 GB. Even if we doubled it, that'd only be 768 MiB. Yes, we'd need to update the Butane template as well, but that should be a small change and there isn't a strong dependency between those steps in either direction.

Moving Ignition out of the initrd

I have serious concerns about the maintainability of our initramfs as is. There are maybe only a couple people who fully understand how it works. The PoC in coreos/fedora-coreos-config#1834 is reasonably small, but I wonder if there will be further consequences for maintenance, such as complicating debugging. That might be worthwhile if gave us a substantial amount of headroom, but the current PoC saves just 7% of the bootfs, and of course it doesn't reduce our shipping weight at all. We can't further improve this by kicking out NetworkManager and network drivers, since we need those for Tang-bound disk encryption. And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in coreos/cargo-vendor-filterer#22 seem relevant here?)

@travier
Copy link
Member

travier commented Jul 11, 2022

Agree that this is two different issues. From the options we have here, it appears that working on zstd compression and increasing the size of /boot (optionally for RHCOS only) are the lowest risk, lowest effort right now and would not be "just a temporary hack".

For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.

@cgwalters
Copy link
Member

cgwalters commented Jul 11, 2022

And I'm not convinced saving 27 MB is worth any increase in initrd complexity. (@cgwalters, your arguments in coreos/cargo-vendor-filterer#22 seem relevant here?)

I have a rather strong opinion that we must support in-place upgrades for the next (let's say) two years with our current 384M /boot.

Functional automatic upgrades is a core raison d'être for us - it has to work.

The discussion in that issue is really about a situation almost the polar opposite in many ways; cloud storage capacity is...a lot larger.

Now I'd restate that...I think ostreedev/ostree#2670 will tide us over for quite a while. It has some drawbacks, but...eh. I also think in the medium term (~year) that we should invest in thinning out the initramfs, and in parallel consider bumping the default size of /boot.

@cgwalters
Copy link
Member

cgwalters commented Jul 11, 2022

For existing installations, the ostree change might be our only reasonably safe option but it would need to be backported to existing systems first before they can update to a version with a bigger kernel.

Oh...right. I had not considered this. So to properly handle this we'll need to use the cincinatti upgrade graph.

A major advantage then of shrinking the initramfs is it will Just Work without requiring any changes on existing systems.

@dustymabe
Copy link
Member

Of the options:

  • initrd compression

Switching to zstd sounds most promising, but sucks that will make `coreos-installer pick up non-"pure rust" dep.

  • bootfs size

easiest to see the benifits, but doesn't help existing nodes.

  • moving Ignition out of the initrd

As mentioned by others here, I'm worried here about complexity and would prefer we not do this or at least be VERY careful if we do. I'd prefer not modifying what we ship currently, but rather maybe do some postprocessing on the local node to find identical components in the multiple initramfs images and factor them out into a shared layered initrd. This itself would have risks.

  • modifying ostree to aggressively delete rollback deployment

This also worries me, but could probably be done safely in a way that minimizes risk and only if we need it.

@dustymabe
Copy link
Member

I opened #1465 to handle our long term "increase /boot partition size" goal.

For our short term goal (and also the long term goal for updating systems where we won't increase the size of /boot) we'll rely on OSTree autopruning for that.

We'll let this ticket track the OSTree autopruning work and will thus be closed out when that lands.

@dustymabe dustymabe removed the meeting topics for meetings label Apr 12, 2023
jlebon added a commit to jlebon/ostree that referenced this issue Apr 13, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.

[1]: coreos/fedora-coreos-tracker#1247
[2]: ostreedev#2670 (comment)

Closes: ostreedev#2670
@dustymabe
Copy link
Member

The PR for OSTree autopruning is ostreedev/ostree#2847

jlebon added a commit to jlebon/ostree that referenced this issue May 1, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.

[1]: coreos/fedora-coreos-tracker#1247
[2]: ostreedev#2670 (comment)

Closes: ostreedev#2670
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 25, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs becaue the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 25, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs becaue the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 26, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs becaue the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 26, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs becaue the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 26, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 26, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 31, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue May 31, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
While looking at the boot space issue[1], I tested whether stripping
binaries saved anything. Currently, dracut tries to strip binaries if
`strip` (or `eu-strip`) are available, which isn't the case in FCOS.

Adding it as a test, it turns out stripping barely saves anything
because we've already split out debug symbols into separate RPMs, and
the remaining symbols don't take much space.

So let's just tell dracut to stop trying to opportunistically strip
anything to be consistent. This then squashes a message emitted about
`strip` not being found.

[1] coreos/fedora-coreos-tracker#1247
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
On my system, relative to gzip -9, it roughly halves decompression time at
boot while also reducing initrd size.

RHCOS 8 can't inherit this because the RHEL 8 kernel doesn't enable
CONFIG_RD_ZSTD.

This requires coreos-installer to support decoding zstd (for
`coreos-installer pxe customize`), which is
coreos/coreos-installer#920.

For coreos/fedora-coreos-tracker#1247.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
While looking at the boot space issue[1], I tested whether stripping
binaries saved anything. Currently, dracut tries to strip binaries if
`strip` (or `eu-strip`) are available, which isn't the case in FCOS.

Adding it as a test, it turns out stripping barely saves anything
because we've already split out debug symbols into separate RPMs, and
the remaining symbols don't take much space.

So let's just tell dracut to stop trying to opportunistically strip
anything to be consistent. This then squashes a message emitted about
`strip` not being found.

[1] coreos/fedora-coreos-tracker#1247
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
On my system, relative to gzip -9, it roughly halves decompression time at
boot while also reducing initrd size.

RHCOS 8 can't inherit this because the RHEL 8 kernel doesn't enable
CONFIG_RD_ZSTD.

This requires coreos-installer to support decoding zstd (for
`coreos-installer pxe customize`), which is
coreos/coreos-installer#920.

For coreos/fedora-coreos-tracker#1247.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The ppc64le arch has been blocked [1] from being fully released because
of size limitations in /boot not being able to hold 3 copies of
kernel+initramfs because the kernel on ppc64le isn't compressed [2].
Now that OSTree Autopruning [3] has landed let's enable it on ppc64le
to unblock ourselves.

[1] coreos/fedora-coreos-tracker#987 (comment)
[2] coreos/fedora-coreos-tracker#1247 (comment)
[3] coreos/fedora-coreos-tracker#1495
@jlebon
Copy link
Member

jlebon commented Jul 4, 2024

In #1247 (comment), we agreed to have this ticket track the auto-pruning work and let #1465 track the longer-term of increasing the boot partition. Auto-pruning has long landed now, so let's close this.

@jlebon jlebon closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants