Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrading to runc 1.1.6 / 1.1.7 breaks #3223

Closed
BenTheElder opened this issue May 12, 2023 · 24 comments · Fixed by #3256
Closed

upgrading to runc 1.1.6 / 1.1.7 breaks #3223

BenTheElder opened this issue May 12, 2023 · 24 comments · Fixed by #3256
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@BenTheElder
Copy link
Member

See: #3220, https://kubernetes.slack.com/archives/CEKK1KTN2/p1683851267796889

#3221 and #3222 have test results.

The failure mode is like:

{ failed [FAILED] failed to run command '/agnhost dns-suffix' on pod, stdout: , stderr: , err: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "69cc58960cac2f68567eec2a84011360392a649b12c91f955d6ddfeb6e34b80a": OCI runtime exec failed: exec failed: unable to start container process: error adding pid 78936 to cgroups: failed to write 78936: openat2 /sys/fs/cgroup/unified/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod60faed4d_72a9_453e_be90_24e96a46d7e5.slice/cri-containerd-d9b097ed59d27ead25448f023222ac459cc26d37682af62785f486008271b361.scope/cgroup.procs: no such file or directory: unknown: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "69cc58960cac2f68567eec2a84011360392a649b12c91f955d6ddfeb6e34b80a": OCI runtime exec failed: exec failed: unable to start container process: error adding pid 78936 to cgroups: failed to write 78936: openat2 /sys/fs/cgroup/unified/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod60faed4d_72a9_453e_be90_24e96a46d7e5.slice/cri-containerd-d9b097ed59d27ead25448f023222ac459cc26d37682af62785f486008271b361.scope/cgroup.procs: no such file or directory: unknown

This is limited to cgroups v1, this happens in our CI environment which is further awful by way of host node containerd => dockerd (in CI cluster pod) => kind (running against that nested dockerd)

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label May 12, 2023
@BenTheElder
Copy link
Member Author

xref: kubernetes/k8s.io#5276 (we'll need to make sure we continue to cover v1 elsewhere for some time)

@BenTheElder
Copy link
Member Author

BenTheElder commented May 15, 2023

I ran an image built for k8s 1.27.1 with #3221 but on a GKE 1.26 / cgroupv2 node pool and no issues so far. While checking into kubernetes/k8s.io#5276

Eventually a lot of these headaches will go away with v1 goes away, but not just yet, probably another 1-2 years.

@BenTheElder BenTheElder self-assigned this May 15, 2023
@BenTheElder
Copy link
Member Author

In the CI nested environment we have:

runc version 1.1.5
commit: v1.1.5-0-gf19387a
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.1

with docker 23.0.4

However there is also the host CI node level containerd/runc. I don't think I have direct access to the k8s infra CI nodes so it's not quite as easy to confirm the versions there.

@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 19, 2023
@BenTheElder
Copy link
Member Author

With #3221 rebased atop #3241 / #3240 runc 1.1.6 is working in Prow CI.

That leaves k8s < 1.24 to consider. Looking into options.

This was referenced May 21, 2023
@BenTheElder
Copy link
Member Author

I think we can just push older k8s versions with a base image from to make -C images/base push EXTRA_BUILD_OPT=--build-arg=RUNC_VERSION=v1.1.5 TAG_SUFFIX=_runc-v1.1.5 until we phase them out and add a release note about this.

@BenTheElder
Copy link
Member Author

BenTheElder commented May 23, 2023

There's still some less frequent issues with misc controller in kubernetes CI. ref: #3250 (comment)

The host runtime is not aware of misc, and probably won't be for a while.

@BenTheElder
Copy link
Member Author

The trick from @kolyshkin to unmount the misc controller doesn't appear to work even if we add some logic to consider misc unsupported when on cgroupv1 + kubernetes without the kubelet runc update.

Tentatively systemd discovers that misc is available and enabled on the host kernel via /proc/cgroups and mounts it back after we've removed it.

We have a bug currently where we'd mount it back as well, but even after fixing this and confirming it's not mounted before exec to systemd I see it is mounted later after the node container comes up.

After inspecting systemd's logic for this, considering bind mounting a modified /proc/cgroups to pretend misc isn't available when on cgroups v1 hosts 🙈 We could unmount that shortly after systemd comes up / bootstraps cgroups.

Someday we will only need to support hosts with cgroups v2 and we can phase out most of the nonsense kind employs currently. At least we're always using cgroupns starting with the next release (#3241).

@BenTheElder
Copy link
Member Author

We also should get around to fixing the horrible dind setup that the main Kubernetes CI is running (which is itself kubernetes pods), but similarly considering if we can get that switched to cgroups v2 first (kubernetes/k8s.io#5276) and just test v1 in github actions without dind for the remaining users that haven't switched yet.

@medyagh
Copy link

medyagh commented May 25, 2023

it seems like minikube is already using runc 1.1.7, we have not faced any issues yet (or discovered it yet)
I am curious does the docker desktop on MacOs use cgroup v2 ?

@BenTheElder do you know a specific OS we could try on to see if it would fail for minikube ? the oldest ubuntu on free github action is ubuntu 20.04 (and minikube github action tests run on that) that seems to be cgroup v1

@BenTheElder
Copy link
Member Author

I am curious does the docker desktop on MacOs use cgroup v2 ?

Yes.

do you know a specific OS we could try on to see if it would fail for minikube ? the oldest ubuntu on free github action is ubuntu 20.04

You won't see cluster bring up fail at least with kind, but once pods have been running for a while things will start to fail (e.g. when running e2e tests container execs will break).

I'm currently developing with a GCE VM on ubuntu 23.04 but doing:

# manually add to linux cmd: `systemd.unified_cgroup_hierarchy=0`
sudo nano /etc/default/grub
# reboot with this config
sudo update-grub
sudo reboot

This ensures a new enough kernel to have misc controller yet putting the distro back on v1.

I was planning to follow up with minikube when we had a solution, there's been some other recent patches for always using cgroupns=private but they're not quite fully baked yet.

For now I'd recommend moving back to 1.1.5, the bug fixes since 1.1.5 are mostly pretty minor currently.

@BenTheElder
Copy link
Member Author

BenTheElder commented May 25, 2023

If you use Kubernetes without the recent patches to update to runc 1.1.6, (only available in 1.24+ on latest patch versions) the problems are worse. opencontainers/runc#3849

@BenTheElder
Copy link
Member Author

Looks like the bind mount hack will do it as expected, need to do some more testing and cleanup.

HBO chernobyl not great, not terrible meme

@BenTheElder
Copy link
Member Author

#3255 should resolve this.

The PR body outlines the core necessary parts, the change itself is a bit messy so I've outlined the key approach in the PR body / comments.

@medyagh
Copy link

medyagh commented May 25, 2023

an update on minikube side: we could reproduce this bug for minikube, even though we have latest runc version,
we tried it (ubuntu 20.04) though:

# manually add to linux cmd: `systemd.unified_cgroup_hierarchy=0`
sudo nano /etc/default/grub
# reboot with this config
sudo update-grub
sudo reboot

however it worth noting after doing the ^^ the mount grep was still showing cgroupv2. so maybe we failed to make it cgroup v1.

@BenTheElder
Copy link
Member Author

To reproduce you also need a new enough kernel to have the misc controller (~5.15, depending on distro patches).

I used 23.04. Also make sure to set unified underGRUB_CMDLINE_LINUX= and save.

bentheelder@cgroups:~$ cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

@spowelljr
Copy link
Member

I'll try with an Ubuntu 23.04 machine, previously what I tried was:

$ uname -a
Linux ubuntu-20-agent-5 5.15.0-1034-gcp #42~20.04.1-Ubuntu SMP Thu May 18 05:40:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

# modify grub
$ cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX"
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

$ sudo update-grub
$ sudo reboot

# cgroup2 still showing?
$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

$ minikube start --kubernetes-version=v1.23.0

# then tried repro based on https://github.com/opencontainers/runc/issues/3849

@BenTheElder
Copy link
Member Author

BenTheElder commented May 25, 2023

# cgroup2 still showing?
$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

If cgroup2 is on /sys/fs/cgroup/unified this is not cgroupv2 mode this is cgroup v1 + v2.

Docker etc should still use v1 then, systemd calls this "hybrid" mode (https://systemd.io/CGROUP_DELEGATION/)..

That's expected. You don't need pure v1 mode. You do need misc enabled though.

@BenTheElder
Copy link
Member Author

FWIW: I'm not currently reproducing the issue, what I'm looking for is misc in-use, since I already settled on just disabling misc in v1 (see discussion in #3255). But when I was, the reproducer in the runc issue was sufficient.

@BenTheElder
Copy link
Member Author

I would also recommend considering #3241 while working on the cgroups support.

It has the downside of raising the minimum docker version to 20.10.0 (2.5 years old), but makes the whole containers-in-containers thing a log cleaner. We get this by default from all major runtimes with the transition to cgroups v2 but as long as users are on v1, v1 with cgroupns on is a lot better. For kind at least that required some additional fixups, for minikube it might be as simple as added --cgroupns=private option to the node containers.

@BenTheElder BenTheElder added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 26, 2023
@BenTheElder
Copy link
Member Author

kind is on runc 1.1.7 now

@alexeadem
Copy link

alexeadem commented Jun 7, 2023

I see all the changes to fix this happened at the kind node base image. I'm using image
hub.docker.com/kindest/node:v1.27.2hub.docker.comkindest/nodev1.27.2 and I'm unable to kubectl exec into the pods
with the following error

kubectl exec -it kindnet-9x65w -n kube-system -- bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "802ffd1ad2b8344badec3079b44b13619109e7b59e1576dd3816d0c2752661af": OCI runtime exec failed: exec failed: unable to start container process: error adding pid 846 to cgroups: failed to write 846: openat2 /sys/fs/cgroup/unified/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-podac27167f_93ea_4991_9fb0_7bd3a9b6c463.slice/cri-containerd-453a4ebc9ef15c15a76e4cea9646bb9638a44ce32c51f300024718822cb05730.scope/cgroup.procs: no such file or directory: unknown

My guess is that newer kind k8s nodes images haven't been updated/rebuild with the new base image?

Is there a workaround at the 0S level I can use to make newer k8s versions to run without this issue in cgroup v1?

I'm using

sudo grubby --update-kernel=`sudo grubby --default-kernel` --args="systemd.unified_cgroup_hierarchy=0"
mount | grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

therefore hybrid mode

OS version

NAME="Fedora Linux"
VERSION="38 (Thirty Eight)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Thirty Eight)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
6.2.14-300.fc38.x86_64

@BenTheElder
Copy link
Member Author

Thanks for the report.

To avoid image changes like this please use the digests as instructed in the release notes (IE @sha256...)

@BenTheElder
Copy link
Member Author

To fully debug your environment we'l need a full bug template report with the information requested there.

I suspect this is due to cgroupv1 being used for the cluster nodes without cgroupns=private. In kind v0.20.0 we will force it to always cgroupns=private, but the images were expected to continue to work with cgroupns=host.

My guess is that newer kind k8s nodes images haven't been updated/rebuild with the new base image?

1.27.2 has been, so probably the opposite issue?

In the short term if you use the digest pinning you will be able to use a version predating these base image changes.

Or, you could try the latest kind code at HEAD and see if the cgroupns=private change solves it.
If you're using docker there's a dockerd flag to change the default, or switching to cgroupsv2 (docker and podman switch to cgroupns=private when using cgroupsv2 unified).

@alexeadem
Copy link

alexeadem commented Jun 7, 2023

I'm using qbo which uses kind node images. I found the the docker API equivalent

"CgroupnsMode": "private",\

https://docs.docker.com/engine/api/v1.43/#tag/Container/operation/ContainerCreate

Thanks for the quick response and the help. It is all working fine now and I can do kubectl exec without issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants