Consider separating `nvidia-ctk hook` from other functions? #435

deitch · 2024-04-02T12:56:45Z

Would the team consider separating nvidia-ctk hook from other functions?

The rationale is as follows.

The hook behaviours - currently chmod, create-symlinks, update-ldcache - run standard behaviours. They do not depend on anything GPU-specific or libnvidia-ml.so or CUDA in any way. The action of generating a CDI for a particular device (preparation), and the act of performing hook functions (runtime) are distinct, and can be executed at different times. I might even use a corporate or other non-JP OS which is not Ubuntu-based, generate my CDI inside a container, save the CDI yaml, and then a different process (which does not have access to the correct versions glibc or libnvidia-ml.so or other dependencies) would run the hooks.

This would allow me to separate those two functionalities. It also would make the hook tool build much simpler.

One final advantage is understandability. When I run nvidia-ctk cdi generate, I have the option --nvidia-ctkl-path <path>, which really means "what path to nvidia-ctk should I place in the CDI yaml?" This can be confusing, since, aren't I executing nvidia-ctk right now? What does it mean to change my path to... myself?

If we separate it, e.g. nvidia-cdi-hook, then it becomes easier to understand: nvidia-ctk cdi generate --nvidia-cdi-hook-path /path/to/nvidia-cdi-hook.

I would be fine opening a PR for it, but not going to waste the time if the maintainers don't want it.

The text was updated successfully, but these errors were encountered:

deitch · 2024-04-08T13:22:05Z

Any feedback on this?

elezar · 2024-04-08T14:04:46Z

This does sound feasible. We would have to provide a migration path for where nvidia-ctk is used directly.

@klueska what are your thoughts?

deitch · 2024-04-08T14:13:40Z

The binary is not all that big, the functionality that is handled in the hooks part is quite small. I was thinking we could duplicate it - leave it in both nvidia-ctk and the new binary for a whole.

deitch · 2024-04-19T11:03:21Z

Hi, coming back to this. I still have interest, and am willing to get it done. Let me know?

klueska · 2024-04-22T08:37:13Z

Just to make sure I understand -- the proposal is to:

Ship an additional binary in nvidia-container-toolkit-base called nvidia-cdi-hook
This new binary essentially replaces nvidia-ctk hook <args> with nvidia-cdi-hook <args>
For now, we would leave the duplicated functionality in nvidia-ctk hook, but may consider removing it later.

With the primary reason for this being that (at least today) none of the hooks rely on having an nvidia driver installed in order to run. And the secondary reason being for readability (i.e. point to a specific --hook-path not back at the full nvidia-ctk path directly).

This seems reasonable to me -- with the second argument actually being the stronger one from my perspective. Mostly because I don't want to forbid us from creating a hook at some point in the future that does rely on calling into e.g. NVML if necessary.

deitch · 2024-04-22T08:43:02Z

Ship an additional binary in nvidia-container-toolkit-base called nvidia-cdi-hook

Yes, although I have no opinion on the name, whatever works is fine with me.

This new binary essentially replaces nvidia-ctk hook with nvidia-cdi-hook

Yes.

For now, we would leave the duplicated functionality in nvidia-ctk hook, but may consider removing it later.

Yes. I would extract it all into a go pkg, and then both can include it and avoid much duplication.

Mostly because I don't want to forbid us from creating a hook at some point in the future that does rely on calling into e.g. NVML if necessary.

That is a good point. I think it might be easier and cleaner, if we ever get there, to have nvidia-cdi-hook call nvidia-ctk, but I could be wrong. In any case, that is a problem for the future.

I will see if I can find time in the next week or two to put together a draft PR.

deitch · 2024-05-03T09:30:53Z

As I work on that open PR - which is ready in theory, although has issues that are in main as well, we are stuck on that part; help is appreciated - it occurs to me that the 3 things that nvidia-cdi-hook does are, well, basic:

create symlinks (in the context of a container)
chmod (container root prepended to target paths)
update ldcache (run ldconfig in the chroot of the container)

Would it make sense for these things to be "basic" CDI services? The CDI spec is here, includes container edit things like mounts, env, deviceNodes, additionalGIDs, and of course hooks. I think these functions, or at least some of them, would make sense to be first-level capabilities of containerEdits. Thoughts?

elezar · 2024-05-03T09:37:18Z

@deitch extending the CDI specification to handle these operations as first-class concepts may make sense. The reason these are currently run as hooks is that they depend on the container root which is not available at the point of CDI spec generation. Would you create an issue against the CDI repository and I can raise this feature request at a future working group meeting.

deitch · 2024-05-03T09:43:26Z

Hi Evan, certainly. Just did it, right after you posted your response. See this CDI issue.

That doesn't eliminate the need for the PR, which I would like to get through. Until the spec extension gets approved (if it gets approved), and it gets in, and high-level container runtimes like containerd support it, it can be a long time.

Do you mind commenting on that #474 , if you have any understanding why main is missing things that v1.15.0 is not?

NVIDIA Container Toolkit 0.16.0 changed the hook arguments in the Container Device Interface specification generated by it [1]. Having the unknown hook arguments show up in the debug logs makes it easier to understand what happened. [1] NVIDIA Container Toolkit commit 179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit@179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit#435

NVIDIA Container Toolkit 0.16.0 changed the hook arguments in the Container Device Interface specification generated by it [1]. Having the unknown hook arguments show up in the debug logs makes it easier to understand what happened. [1] NVIDIA Container Toolkit commit 179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit@179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit#435 containers#1543

NVIDIA Container Toolkit 0.16.0 changed the hook arguments in the Container Device Interface specification generated by it [1]. Fallout from 649d02f [1] NVIDIA Container Toolkit commit 179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit@179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit#435

NVIDIA Container Toolkit 0.16.0 changed the hook arguments in the Container Device Interface specification generated by it [1]. Fallout from 649d02f [1] NVIDIA Container Toolkit commit 179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit@179d8655f9b5fce6 NVIDIA/nvidia-container-toolkit#435 containers#1544

deitch mentioned this issue Apr 24, 2024

move nvidia-ctk hook command into own binary #474

Merged

elezar closed this as completed in #474 May 21, 2024

debarshiray mentioned this issue Sep 17, 2024

cmd/initContainer: Log unknown Container Device Interface hook arguments containers/toolbox#1543

Merged

debarshiray mentioned this issue Sep 17, 2024

cmd/initContainer: Unbreak application of CDI hooks for NVIDIA containers/toolbox#1544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider separating `nvidia-ctk hook` from other functions? #435

Consider separating `nvidia-ctk hook` from other functions? #435

deitch commented Apr 2, 2024

deitch commented Apr 8, 2024

elezar commented Apr 8, 2024

deitch commented Apr 8, 2024

deitch commented Apr 19, 2024

klueska commented Apr 22, 2024

deitch commented Apr 22, 2024

deitch commented May 3, 2024

elezar commented May 3, 2024

deitch commented May 3, 2024

Consider separating nvidia-ctk hook from other functions? #435

Consider separating nvidia-ctk hook from other functions? #435

Comments

deitch commented Apr 2, 2024

deitch commented Apr 8, 2024

elezar commented Apr 8, 2024

deitch commented Apr 8, 2024

deitch commented Apr 19, 2024

klueska commented Apr 22, 2024

deitch commented Apr 22, 2024

deitch commented May 3, 2024

elezar commented May 3, 2024

deitch commented May 3, 2024

Consider separating `nvidia-ctk hook` from other functions? #435

Consider separating `nvidia-ctk hook` from other functions? #435