-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Change-Id: I4f65aa8f2fe7e52614c48de4cee5bd8a19250aad
- Loading branch information
1 parent
bd7fdb1
commit c3619ec
Showing
4 changed files
with
372 additions
and
0 deletions.
There are no files selected for viewing
151 changes: 151 additions & 0 deletions
151
src/blog-bubblewrap-in-kubernetes-pod-with-procmount.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
This post explores how to create nested containers securely inside Kubernetes. | ||
In the previous post titled [Recursive namespaces to run containers inside a container][prev-post] | ||
I showed how to create nested containers using a rootless container runtimes like Podman. | ||
In this post, I'll demonstrate how to run the same workload with [Kubernetes][k8s]. | ||
|
||
In two parts, I will present: | ||
|
||
- How to run Kubernetes from source. | ||
- The ProcMountType feature to work around the original issue. | ||
|
||
|
||
## Context and problem statement | ||
|
||
The context of this post is to deploy a service named zuul-executor for running CI builds securely inside Kubernetes, | ||
without requiring a privileged security context. | ||
|
||
The problem is that this service performs build isolation locally using [Bubblewrap][bwrap], | ||
which is similar to running a container inside a container. | ||
|
||
|
||
## Run kubernetes locally | ||
|
||
In this section, let's set up Kubernetes locally. | ||
On a fresh Fedora 41 system, install the following requirements: | ||
|
||
```ShellSession | ||
$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins | ||
$ sudo systemctl start crio | ||
``` | ||
|
||
Then, start Kubernetes using the *local-up-cluster* script as follows: | ||
|
||
```ShellSession | ||
$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes | ||
$ git clone https://github.com/kubernetes/kubernetes/ | ||
$ cd kubernetes | ||
$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \ | ||
./hack/local-up-cluster.sh | ||
... | ||
Local Kubernetes cluster is running. Press Ctrl-C to shut it down. | ||
``` | ||
|
||
… using the following test resource: | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-bwrap | ||
spec: | ||
containers: | ||
- name: test | ||
image: quay.io/zuul-ci/zuul-executor | ||
command: ["/bin/sleep", "infinity"] | ||
securityContext: | ||
capabilities: | ||
add: ["SETFCAP"] | ||
``` | ||
> As seen previously, we need *CAP_SETFCAP* to create the user namespace, otherwise bwrap fails early with the following error: | ||
> | ||
> ``` | ||
> bwrap: setting up uid map: Operation not permitted | ||
> ``` | ||
|
||
Apply the test resource with the following commands: | ||
|
||
```ShellSession | ||
$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig | ||
$ kubectl apply -f test-bwrap.yaml | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
bwrap: Can't mount proc on /newroot/proc: Operation not permitted | ||
``` | ||
|
||
This produces the same error we encountered in the [previous post][prev-post]: the /proc filesystem is tainted in the pod, preventing Bubblewrap from being able to create a new procfs for the new PID namespace. | ||
|
||
The next section introduces the *ProcMountType* feature to work around this issue. | ||
|
||
## The ProcMountType feature | ||
|
||
The *ProcMountType* feature can be enabled by adding the following environment variable to the *local-up-cluster*: `FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'`. | ||
To make use of the new feature, we also need to activate *UserNamespacesSupport*, as explained in the following [documentation](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access). | ||
|
||
With these features, we can update the resource like that: | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-bwrap | ||
spec: | ||
hostUsers: false | ||
containers: | ||
- name: test | ||
image: quay.io/zuul-ci/zuul-executor | ||
command: ["/bin/sleep", "infinity"] | ||
securityContext: | ||
procMount: Unmasked | ||
capabilities: | ||
add: ["SETFCAP"] | ||
``` | ||
|
||
… using the following commands: | ||
|
||
``` | ||
$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml | ||
pod/test-bwrap created | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
bwrap: Can't mount proc on /newroot/proc: Permission denied | ||
``` | ||
|
||
This time we get a new permission denied, which is caused by SELinux. Using *audit2allow*, we can see that the following policy needs to be installed: | ||
|
||
``` | ||
module nestedcontainers 1.0; | ||
require { | ||
type proc_t; | ||
type devpts_t; | ||
type container_t; | ||
class filesystem mount; | ||
} | ||
#============= container_t ============== | ||
allow container_t devpts_t:filesystem mount; | ||
allow container_t proc_t:filesystem mount; | ||
``` | ||
|
||
… which lets us run Bubblewrap inside an unprivileged pod: | ||
|
||
```ShellSession | ||
$ sudo semodule -i nestedcontainers.pp | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
PID TTY STAT TIME COMMAND | ||
1 ? Ss 0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx | ||
2 ? R 0:00 ps afx | ||
``` | ||
|
||
Notice how the `sleep infinity` process is not visible in the ps output, confirming that we are indeed running in a nested container. | ||
|
||
## Conclusion | ||
|
||
This post demonstrates that we can run a container inside a container with Kubernetes thanks to the following settings: | ||
|
||
- The SETFCAP to create the user namespace, | ||
- The ProcMountType and UserNamespacesSupport to unmask the /proc filesystem, and | ||
- A SELinux policy to enable mounting filesystems inside the new namespace. | ||
|
||
[prev-post]: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html | ||
[k8s]: https://kubernetes.io/ | ||
[bwrap]: https://github.com/containers/bubblewrap |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
Secure Bubblewrap inside Kubernetes with ProcMount | ||
################################################## | ||
|
||
:date: 2024-12-09 | ||
:category: blog | ||
:authors: tristanC | ||
|
||
.. raw:: html | ||
|
||
<style type="text/css"> | ||
.literal { | ||
border-radius: 6px; | ||
padding: 1px 1px; | ||
background-color: rgba(27,31,35,.05); | ||
} | ||
</style> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
#! /usr/bin/env nix-shell | ||
#! nix-shell -i bash -p pandoc | ||
#! nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4d2b37a84fad1091b9de401eb450aae66f1a741e.tar.gz | ||
|
||
NAME="blog-bubblewrap-in-kubernetes-pod-with-procmount" | ||
|
||
pandoc --include-in-header=./$NAME.rst \ | ||
-f gfm --reference-links \ | ||
-t rst ./$NAME.md -o ../website/content/$NAME.rst | ||
|
||
sed -e 's|^.. code::|.. code-block::|' -i ../website/content/$NAME.rst |
192 changes: 192 additions & 0 deletions
192
website/content/blog-bubblewrap-in-kubernetes-pod-with-procmount.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
Secure Bubblewrap inside Kubernetes with ProcMount | ||
################################################## | ||
|
||
:date: 2024-12-09 | ||
:category: blog | ||
:authors: tristanC | ||
|
||
.. raw:: html | ||
|
||
<style type="text/css"> | ||
.literal { | ||
border-radius: 6px; | ||
padding: 1px 1px; | ||
background-color: rgba(27,31,35,.05); | ||
} | ||
</style> | ||
|
||
This post explores how to create nested containers securely inside | ||
Kubernetes. In the previous post titled `Recursive namespaces to run | ||
containers inside a container`_ I showed how to create nested containers | ||
using a rootless container runtimes like Podman. In this post, I'll | ||
demonstrate how to run the same workload with `Kubernetes`_. | ||
|
||
In two parts, I will present: | ||
|
||
- How to run Kubernetes from source. | ||
- The ProcMountType feature to work around the original issue. | ||
|
||
Context and problem statement | ||
============================= | ||
|
||
The context of this post is to deploy a service named zuul-executor for | ||
running CI builds securely inside Kubernetes, without requiring a | ||
privileged security context. | ||
|
||
The problem is that this service performs build isolation locally using | ||
`Bubblewrap`_, which is similar to running a container inside a | ||
container. | ||
|
||
Run kubernetes locally | ||
====================== | ||
|
||
In this section, let's set up Kubernetes locally. On a fresh Fedora 41 | ||
system, install the following requirements: | ||
|
||
.. code-block:: ShellSession | ||
$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins | ||
$ sudo systemctl start crio | ||
Then, start Kubernetes using the *local-up-cluster* script as follows: | ||
|
||
.. code-block:: ShellSession | ||
$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes | ||
$ git clone https://github.com/kubernetes/kubernetes/ | ||
$ cd kubernetes | ||
$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \ | ||
./hack/local-up-cluster.sh | ||
... | ||
Local Kubernetes cluster is running. Press Ctrl-C to shut it down. | ||
… using the following test resource: | ||
|
||
.. code-block:: yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-bwrap | ||
spec: | ||
containers: | ||
- name: test | ||
image: quay.io/zuul-ci/zuul-executor | ||
command: ["/bin/sleep", "infinity"] | ||
securityContext: | ||
capabilities: | ||
add: ["SETFCAP"] | ||
.. | ||
As seen previously, we need *CAP_SETFCAP* to create the user | ||
namespace, otherwise bwrap fails early with the following error: | ||
|
||
:: | ||
|
||
bwrap: setting up uid map: Operation not permitted | ||
|
||
Apply the test resource with the following commands: | ||
|
||
.. code-block:: ShellSession | ||
$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig | ||
$ kubectl apply -f test-bwrap.yaml | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
bwrap: Can't mount proc on /newroot/proc: Operation not permitted | ||
This produces the same error we encountered in the `previous post`_: the | ||
/proc filesystem is tainted in the pod, preventing Bubblewrap from being | ||
able to create a new procfs for the new PID namespace. | ||
|
||
The next section introduces the *ProcMountType* feature to work around | ||
this issue. | ||
|
||
The ProcMountType feature | ||
========================= | ||
|
||
The *ProcMountType* feature can be enabled by adding the following | ||
environment variable to the *local-up-cluster*: | ||
``FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'``. To | ||
make use of the new feature, we also need to activate | ||
*UserNamespacesSupport*, as explained in the following `documentation`_. | ||
|
||
With these features, we can update the resource like that: | ||
|
||
.. code-block:: yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: test-bwrap | ||
spec: | ||
hostUsers: false | ||
containers: | ||
- name: test | ||
image: quay.io/zuul-ci/zuul-executor | ||
command: ["/bin/sleep", "infinity"] | ||
securityContext: | ||
procMount: Unmasked | ||
capabilities: | ||
add: ["SETFCAP"] | ||
… using the following commands: | ||
|
||
:: | ||
|
||
$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml | ||
pod/test-bwrap created | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
bwrap: Can't mount proc on /newroot/proc: Permission denied | ||
|
||
This time we get a new permission denied, which is caused by SELinux. | ||
Using *audit2allow*, we can see that the following policy needs to be | ||
installed: | ||
|
||
:: | ||
|
||
module nestedcontainers 1.0; | ||
|
||
require { | ||
type proc_t; | ||
type devpts_t; | ||
type container_t; | ||
class filesystem mount; | ||
} | ||
|
||
#============= container_t ============== | ||
allow container_t devpts_t:filesystem mount; | ||
allow container_t proc_t:filesystem mount; | ||
|
||
… which lets us run Bubblewrap inside an unprivileged pod: | ||
|
||
.. code-block:: ShellSession | ||
$ sudo semodule -i nestedcontainers.pp | ||
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx | ||
PID TTY STAT TIME COMMAND | ||
1 ? Ss 0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx | ||
2 ? R 0:00 ps afx | ||
Notice how the ``sleep infinity`` process is not visible in the ps | ||
output, confirming that we are indeed running in a nested container. | ||
|
||
Conclusion | ||
========== | ||
|
||
This post demonstrates that we can run a container inside a container | ||
with Kubernetes thanks to the following settings: | ||
|
||
- The SETFCAP to create the user namespace, | ||
- The ProcMountType and UserNamespacesSupport to unmask the /proc | ||
filesystem, and | ||
- A SELinux policy to enable mounting filesystems inside the new | ||
namespace. | ||
|
||
.. _Recursive namespaces to run containers inside a container: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html | ||
.. _Kubernetes: https://kubernetes.io/ | ||
.. _Bubblewrap: https://github.com/containers/bubblewrap | ||
.. _previous post: https://www.softwarefactory-project.io/recursive-namespaces-to-run-containers-inside-a-container.html | ||
.. _documentation: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#proc-access |