Skip to content

Commit

Permalink
Add bubblewrap-in-kubernetes post
Browse files Browse the repository at this point in the history
Change-Id: I4f65aa8f2fe7e52614c48de4cee5bd8a19250aad
  • Loading branch information
TristanCacqueray committed Dec 10, 2024
1 parent bd7fdb1 commit c3619ec
Show file tree
Hide file tree
Showing 4 changed files with 372 additions and 0 deletions.
151 changes: 151 additions & 0 deletions src/blog-bubblewrap-in-kubernetes-pod-with-procmount.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
This post explores how to create nested containers securely inside Kubernetes.
In the previous post titled [Recursive namespaces to run containers inside a container][prev-post]
I showed how to create nested containers using a rootless container runtimes like Podman.
In this post, I'll demonstrate how to run the same workload with [Kubernetes][k8s].

In two parts, I will present:

- How to run Kubernetes from source.
- The ProcMountType feature to work around the original issue.


## Context and problem statement

The context of this post is to deploy a service named zuul-executor for running CI builds securely inside Kubernetes,
without requiring a privileged security context.

The problem is that this service performs build isolation locally using [Bubblewrap][bwrap],
which is similar to running a container inside a container.


## Run kubernetes locally

In this section, let's set up Kubernetes locally.
On a fresh Fedora 41 system, install the following requirements:

```ShellSession
$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins
$ sudo systemctl start crio
```

Then, start Kubernetes using the *local-up-cluster* script as follows:

```ShellSession
$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes
$ git clone https://github.com/kubernetes/kubernetes/
$ cd kubernetes
$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \
./hack/local-up-cluster.sh
...
Local Kubernetes cluster is running. Press Ctrl-C to shut it down.
```

… using the following test resource:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-bwrap
spec:
containers:
- name: test
image: quay.io/zuul-ci/zuul-executor
command: ["/bin/sleep", "infinity"]
securityContext:
capabilities:
add: ["SETFCAP"]
```
> As seen previously, we need *CAP_SETFCAP* to create the user namespace, otherwise bwrap fails early with the following error:
>
> ```
> bwrap: setting up uid map: Operation not permitted
> ```

Apply the test resource with the following commands:

```ShellSession
$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
$ kubectl apply -f test-bwrap.yaml
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Operation not permitted
```

This produces the same error we encountered in the [previous post][prev-post]: the /proc filesystem is tainted in the pod, preventing Bubblewrap from being able to create a new procfs for the new PID namespace.

The next section introduces the *ProcMountType* feature to work around this issue.

## The ProcMountType feature

The *ProcMountType* feature can be enabled by adding the following environment variable to the *local-up-cluster*: `FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'`.
To make use of the new feature, we also need to activate *UserNamespacesSupport*, as explained in the following [documentation](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access).

With these features, we can update the resource like that:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-bwrap
spec:
hostUsers: false
containers:
- name: test
image: quay.io/zuul-ci/zuul-executor
command: ["/bin/sleep", "infinity"]
securityContext:
procMount: Unmasked
capabilities:
add: ["SETFCAP"]
```

… using the following commands:

```
$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml
pod/test-bwrap created
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Permission denied
```

This time we get a new permission denied, which is caused by SELinux. Using *audit2allow*, we can see that the following policy needs to be installed:

```
module nestedcontainers 1.0;
require {
type proc_t;
type devpts_t;
type container_t;
class filesystem mount;
}
#============= container_t ==============
allow container_t devpts_t:filesystem mount;
allow container_t proc_t:filesystem mount;
```

… which lets us run Bubblewrap inside an unprivileged pod:

```ShellSession
$ sudo semodule -i nestedcontainers.pp
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx
2 ? R 0:00 ps afx
```

Notice how the `sleep infinity` process is not visible in the ps output, confirming that we are indeed running in a nested container.

## Conclusion

This post demonstrates that we can run a container inside a container with Kubernetes thanks to the following settings:

- The SETFCAP to create the user namespace,
- The ProcMountType and UserNamespacesSupport to unmask the /proc filesystem, and
- A SELinux policy to enable mounting filesystems inside the new namespace.

[prev-post]: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
[k8s]: https://kubernetes.io/
[bwrap]: https://github.com/containers/bubblewrap
18 changes: 18 additions & 0 deletions src/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Secure Bubblewrap inside Kubernetes with ProcMount
##################################################

:date: 2024-12-09
:category: blog
:authors: tristanC

.. raw:: html

<style type="text/css">
.literal {
border-radius: 6px;
padding: 1px 1px;
background-color: rgba(27,31,35,.05);
}
</style>
11 changes: 11 additions & 0 deletions src/blog-bubblewrap-in-kubernetes-pod-with-procmount.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#! /usr/bin/env nix-shell
#! nix-shell -i bash -p pandoc
#! nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4d2b37a84fad1091b9de401eb450aae66f1a741e.tar.gz

NAME="blog-bubblewrap-in-kubernetes-pod-with-procmount"

pandoc --include-in-header=./$NAME.rst \
-f gfm --reference-links \
-t rst ./$NAME.md -o ../website/content/$NAME.rst

sed -e 's|^.. code::|.. code-block::|' -i ../website/content/$NAME.rst
192 changes: 192 additions & 0 deletions website/content/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
Secure Bubblewrap inside Kubernetes with ProcMount
##################################################

:date: 2024-12-09
:category: blog
:authors: tristanC

.. raw:: html

<style type="text/css">
.literal {
border-radius: 6px;
padding: 1px 1px;
background-color: rgba(27,31,35,.05);
}
</style>

This post explores how to create nested containers securely inside
Kubernetes. In the previous post titled `Recursive namespaces to run
containers inside a container`_ I showed how to create nested containers
using a rootless container runtimes like Podman. In this post, I'll
demonstrate how to run the same workload with `Kubernetes`_.

In two parts, I will present:

- How to run Kubernetes from source.
- The ProcMountType feature to work around the original issue.

Context and problem statement
=============================

The context of this post is to deploy a service named zuul-executor for
running CI builds securely inside Kubernetes, without requiring a
privileged security context.

The problem is that this service performs build isolation locally using
`Bubblewrap`_, which is similar to running a container inside a
container.

Run kubernetes locally
======================

In this section, let's set up Kubernetes locally. On a fresh Fedora 41
system, install the following requirements:

.. code-block:: ShellSession
$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins
$ sudo systemctl start crio
Then, start Kubernetes using the *local-up-cluster* script as follows:

.. code-block:: ShellSession
$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes
$ git clone https://github.com/kubernetes/kubernetes/
$ cd kubernetes
$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \
./hack/local-up-cluster.sh
...
Local Kubernetes cluster is running. Press Ctrl-C to shut it down.
… using the following test resource:

.. code-block:: yaml
apiVersion: v1
kind: Pod
metadata:
name: test-bwrap
spec:
containers:
- name: test
image: quay.io/zuul-ci/zuul-executor
command: ["/bin/sleep", "infinity"]
securityContext:
capabilities:
add: ["SETFCAP"]
..
As seen previously, we need *CAP_SETFCAP* to create the user
namespace, otherwise bwrap fails early with the following error:

::

bwrap: setting up uid map: Operation not permitted

Apply the test resource with the following commands:

.. code-block:: ShellSession
$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
$ kubectl apply -f test-bwrap.yaml
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Operation not permitted
This produces the same error we encountered in the `previous post`_: the
/proc filesystem is tainted in the pod, preventing Bubblewrap from being
able to create a new procfs for the new PID namespace.

The next section introduces the *ProcMountType* feature to work around
this issue.

The ProcMountType feature
=========================

The *ProcMountType* feature can be enabled by adding the following
environment variable to the *local-up-cluster*:
``FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'``. To
make use of the new feature, we also need to activate
*UserNamespacesSupport*, as explained in the following `documentation`_.

With these features, we can update the resource like that:

.. code-block:: yaml
apiVersion: v1
kind: Pod
metadata:
name: test-bwrap
spec:
hostUsers: false
containers:
- name: test
image: quay.io/zuul-ci/zuul-executor
command: ["/bin/sleep", "infinity"]
securityContext:
procMount: Unmasked
capabilities:
add: ["SETFCAP"]
… using the following commands:

::

$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml
pod/test-bwrap created
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Permission denied

This time we get a new permission denied, which is caused by SELinux.
Using *audit2allow*, we can see that the following policy needs to be
installed:

::

module nestedcontainers 1.0;

require {
type proc_t;
type devpts_t;
type container_t;
class filesystem mount;
}

#============= container_t ==============
allow container_t devpts_t:filesystem mount;
allow container_t proc_t:filesystem mount;

… which lets us run Bubblewrap inside an unprivileged pod:

.. code-block:: ShellSession
$ sudo semodule -i nestedcontainers.pp
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx
2 ? R 0:00 ps afx
Notice how the ``sleep infinity`` process is not visible in the ps
output, confirming that we are indeed running in a nested container.

Conclusion
==========

This post demonstrates that we can run a container inside a container
with Kubernetes thanks to the following settings:

- The SETFCAP to create the user namespace,
- The ProcMountType and UserNamespacesSupport to unmask the /proc
filesystem, and
- A SELinux policy to enable mounting filesystems inside the new
namespace.

.. _Recursive namespaces to run containers inside a container: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
.. _Kubernetes: https://kubernetes.io/
.. _Bubblewrap: https://github.com/containers/bubblewrap
.. _previous post: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html
.. _documentation: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access

0 comments on commit c3619ec

Please sign in to comment.