Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: glibc extension #473

Merged
merged 1 commit into from
Oct 18, 2024
Merged

feat: glibc extension #473

merged 1 commit into from
Oct 18, 2024

Conversation

jfroy
Copy link
Contributor

@jfroy jfroy commented Sep 18, 2024

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit).

The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

@jfroy jfroy changed the title glibc extension feat: glibc extension Sep 18, 2024
@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit).

The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

* A patch to the nvidia container toolkit is required to replace the shell script wrappers with a Go wrapper. See [Replace shell wrapper with a Go wrapper NVIDIA/nvidia-container-toolkit#700](https://github.com/NVIDIA/nvidia-container-toolkit/pull/700).

This is really cool and looking forward to actually not having to maintain patches or wrappers. What i still don't understand is about kernel modules, would SideroLabs be still shipping them as extensions (I believe that's the case since only machined in talos can load modules)?

README.md Outdated Show resolved Hide resolved
@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit).
The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

* A patch to the nvidia container toolkit is required to replace the shell script wrappers with a Go wrapper. See [Replace shell wrapper with a Go wrapper NVIDIA/nvidia-container-toolkit#700](https://github.com/NVIDIA/nvidia-container-toolkit/pull/700).

This is really cool and looking forward to actually not having to maintain patches or wrappers. What i still don't understand is about kernel modules, would SideroLabs be still shipping them as extensions (I believe that's the case since only machined in talos can load modules)?

I am sending more PRs, but this is one of the major changes from an architecture POV: the gpu-operator would be allowed to load and unload kernel modules, which means enabling module unloading in the kernel (see siderolabs/pkgs#1031) and not removing SYS_MODULE from containers (see siderolabs/talos#9339).

Another important note is that only CDI mode (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html) works on Talos with this patch. The "legacy" runtime hook require more libraries to be present on the system, whereas the CDI hook is a pure Go program that only requires the glibc dynamic loader and /sbin/ldconfig.

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

Another important note is that only CDI mode (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html) works on Talos with this patch. The "legacy" runtime hook require more libraries to be present on the system, whereas the CDI hook is a pure Go program that only requires the glibc dynamic loader and /sbin/ldconfig.

We're open to moving to using CDI 👍

@jfroy
Copy link
Contributor Author

jfroy commented Sep 18, 2024

See siderolabs/talos#9339 for main discussion on loading kernel modules.

@jfroy jfroy force-pushed the glibc branch 3 times, most recently from d94cf22 to b4de00e Compare September 21, 2024 06:21
@jfroy
Copy link
Contributor Author

jfroy commented Sep 21, 2024

Patch has been updated to rework the glibc subtree to look like a merged /usr root.

misc/glibc/manifest.yaml Outdated Show resolved Hide resolved
misc/glibc/vars.yaml Outdated Show resolved Hide resolved
jfroy added a commit to jfroy/siderolabs-extensions that referenced this pull request Sep 23, 2024
This patch deprecates the NVIDIA toolkit extension and introduces a new
nvidia-driver extension (in production/lts versions and open
source/proprietary flavors). The NVIDIA container toolkit must be
installed independently, via a future Talos extension, the NVIDIA GPU
Operator, or by the cluster administator.

The extension depends on the new glibc extension (siderolabs#473) and participates
in its filesystem subroot by installing all the NVIDIA components in it.

Finally, the extension runs a service that will bind mount this glibc
subroot at `/run/nvidia/driver` and run the `nvidia-persistenced`
daemon.

This careful setup allows the NVIDIA GPU Operator to utilize this
extension as if it were a traditional NVIDIA driver container.

Signed-off-by: Jean-Francois Roy <[email protected]>
@smira
Copy link
Member

smira commented Sep 26, 2024

I'm going to cut v1.9.0-alpha.0 of Talos, bump the extension validator, and then get back to this PR, thank you for your patience!

@jfroy
Copy link
Contributor Author

jfroy commented Sep 26, 2024

I'm going to cut v1.9.0-alpha.0 of Talos, bump the extension validator, and then get back to this PR, thank you for your patience!

No rush, thank you for considering these PRs!

The extension is mostly copied from the package that is in the nvidia
toolkit extensions, but with a few notable changes.

- A symbolic link to `ldconfig` is installed at `/sbin/ldconfig`. This
allows the nvidia gpu-operator to work without modification. A patch to
the extension validation logic is required to allow the new path.
- The `/usr/local/glibc` subtree is now structured as a [merged `/usr`
root](https://systemd.io/THE_CASE_FOR_THE_USR_MERGE/). This improves
compatbility with ld-linux.so.2 for library discovery.

Signed-off-by: Jean-Francois Roy <[email protected]>
Signed-off-by: Noel Georgi <[email protected]>
@frezbo
Copy link
Member

frezbo commented Oct 18, 2024

@jfroy is this good to go, I see the referenced PR for nvidia-container-toolkit is still not merged 🤔

@jfroy
Copy link
Contributor Author

jfroy commented Oct 18, 2024

@jfroy is this good to go, I see the referenced PR for nvidia-container-toolkit is still not merged 🤔

Yeah I'm working on that internally. The team is just very busy with other work. We are focusing in particular on CDI and DRA. Sidero could package the toolkit in an extension -- the wrappers are created by an installer component that is invoked by the operator and are not an intrinsic part of the toolkit.

In any case, this patch is needed for the CDI hook to run, so it's good to pick up no matter what.

@frezbo
Copy link
Member

frezbo commented Oct 18, 2024

/m

@talos-bot talos-bot merged commit c7eb377 into siderolabs:main Oct 18, 2024
14 checks passed
@frezbo
Copy link
Member

frezbo commented Oct 18, 2024

I'll wait over the weekend and see if our daily runs for nvidia tests works with this change

@jfroy jfroy deleted the glibc branch December 19, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants