NVIDIA iGPU passthrough Support #12525

julianneswinoga · 2023-11-10T10:09:52Z

Hello LXD team! I'm from Partner Engineering at Canonical and I'm working on NVIDIA's Tegra line of devices. These are ARM64 devices with an integrated GPU (iGPU) and sometimes an optional discrete GPU (dGPU). We would like to use LXD/LXC with iGPU passthrough for device testing. LXD already supports NVIDIA dGPUs via the nvidia.runtime=true flag but iGPU passthrough is not supported at the moment.

I've done some initial investigation into how this support could be added and it seems that LXD hands off most of the mounting control to libnvidia-container. The call stack as I understand it is as follows:

Create LXD container with nvidia.runtime=true
- LXD::driver_lxc.go does misc checks on the nvidia hook, sets NVIDIA_VISIBLE_DEVICES environment variable
  - LXC::conf.c runs the shell script/usr/share/lxc/hooks/nvidia (in tree lxc/hooks/nvidia)
    - Calls the nvidia-container-cli program (part of libnvidia-container)
      - Talks to the NVIDIA Management Library (NVML)
        
        Talks to the GPU hardware

While doing this investigation NVIDIA has informed me that libnvidia-container (providing nvidia-container-cli) is in the process of being deprecated (public link) and that libnvidia-container (providing the nvidia-ctk command) is the way forward. I'll also note that NVIDIA is open to supporting this work from their side 🙂

As far as I understand the overall scope of work would be to replace nvidia-container-cli with nvidia-ctk and any transitive work that follows from that.

The text was updated successfully, but these errors were encountered:

tomponline · 2023-11-10T12:04:14Z

thanks, your summary sounds correct to me :)

elezar · 2023-11-16T10:19:33Z

From the NVIDIA Container Toolkit side, our goal is to move to CDI as a mechanism to define what is required to allow access to a named Device or Resource (e.g. nvidia.com/gpu=0. A CDI specification defines a number of "Container Edits" which are required. In essence these form a list of device nodes (e.g. /dev/nvidia0), mounts -- inlcuding user mode driver libraries (e.g. libcuda.so.<RM_VERSION>) and binaries (e.g. nvidia-smi), and container lifecycle hooks (e.g. a hook to update the LDCache in the container so that injected libraries are visible to the process.

In the case of OCI-compliant runtimes, the container edits have a well defined mapping to OCI Runtime spec modifications that are defined as part of the tags.cncf.io/container-device-interface/pkg/cdi Go package. This package is consumed by various clients (Containerd, Cri-o, Docker, Podman, Singulartiy, the NVIDIA Container Runtime) to apply these modifications to an incomming OCI runtime specification when creating a container.

Note that CDI spec generation is separate from CDI spec consumption. Here the NVIDIA Container Toolkit includes an nvidia-ctk cdi generate command that will generate a CDI specification for the current state of the system. On Tegra-based systems, it should automatically detect the platform and use CSV files at /etc/nvidia-container-runtime/host-files-for-container.d/ as input to determine which entities are to be included in the specification. We are still working on tooling to ensure, for example, that CDI specifications are kept up to date with device or driver changes.

The attached image shows a workflow showing the generation and consumption of CDI specifications in the context of OCI-compliant runtimes.

Note that only CDI spec generation needs to query libraries such as NVML to discover devices.

In the context of LXD (or other non-OCI-compliant runtimes), what would be required to allow for the injection of NVIDIA Devices including those associated with IGPs is to support reading a CDI spec associated with a particular device, and applying the required modifications to the container. Note that this will also enable the injection of CDI devices by other vendors that support the specification.

tomponline added the Feature New feature, not a bug label Dec 12, 2023

tomponline assigned gabrielmougard Jun 17, 2024

tomponline added this to the lxd-6.2 milestone Jun 17, 2024

tomponline mentioned this issue Jul 22, 2024

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

Merged

2 tasks

tomponline closed this as completed in #13562 Aug 28, 2024

tomponline closed this as completed in c2862fa Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA iGPU passthrough Support #12525

NVIDIA iGPU passthrough Support #12525

julianneswinoga commented Nov 10, 2023

tomponline commented Nov 10, 2023

elezar commented Nov 16, 2023 •

edited

Loading

NVIDIA iGPU passthrough Support #12525

NVIDIA iGPU passthrough Support #12525

Comments

julianneswinoga commented Nov 10, 2023

tomponline commented Nov 10, 2023

elezar commented Nov 16, 2023 • edited Loading

elezar commented Nov 16, 2023 •

edited

Loading