Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option in toolkit container to enable CDI in runtime #838

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Dec 18, 2024

These changes add an option to enable CDI in container engines (containerd, crio, docker) from the toolkit contaienr.

This can be triggered through the --enable-cdi-in-runtime command line option or RUNTIME_ENABLE_CDI environment variable.

@cdesiniotis cdesiniotis force-pushed the enable-cdi-toolkit-container branch 2 times, most recently from 182d161 to 0aaafb4 Compare December 18, 2024 20:28
@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
Hidden: true,
},
&cli.BoolFlag{
Name: "runtime-enable-cdi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the flag be passed as --enable-cdi. The envar and the Option variable can have the Runtime/RUNTIME prefix. This is consistent with the other flags like socket, restart-mode etc

Suggested change
Name: "runtime-enable-cdi",
Name: "enable-cdi",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to be consistent with the option for the nvidia-ctk runtime configure command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we already have a cdi-enabled / enable-cdi flag defined in the toolkit options which triggers the generation of a CDI spec:

&cli.BoolFlag{
Name: "cdi-enabled",
Aliases: []string{"enable-cdi"},
Usage: "enable the generation of a CDI specification",
Destination: &opts.cdiEnabled,
EnvVars: []string{"CDI_ENABLED", "ENABLE_CDI"},
},

Any ideas on what the name of the new flag should be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.

One question I would have is whether we need a separate flag? Would it not make sense for the same flag / envvar (e.g. CDI_ENABLED) to be use to also trigger enabling CDI in the runtime? Do older versions of containerd complain about invalid config options?

If we feel we want finer control, then I'm ok to use RUNTIME_ENABLE_CDI or ENABLE_CDI_IN_RUNTIME to be used. (maybe the latter).

@@ -89,6 +90,13 @@ func Flags(opts *Options) []cli.Flag {
EnvVars: []string{"NVIDIA_RUNTIME_SET_AS_DEFAULT", "CONTAINERD_SET_AS_DEFAULT", "DOCKER_SET_AS_DEFAULT"},
Hidden: true,
},
&cli.BoolFlag{
Name: "runtime-enable-cdi",
Usage: "Enable CDI in the configured runtime",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Usage: "Enable CDI in the configured runtime",
Usage: "Enable Container Device Interface (CDI) in the configured runtime",

@@ -163,3 +163,7 @@ func (c *ConfigV1) GetRuntimeConfig(name string) (engine.RuntimeConfig, error) {
tree: runtimeData,
}, nil
}

func (c *ConfigV1) EnableCDI() {
c.Set("enable_cdi", true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to cast c to a Config type and call EnableCDI here instead of reimplementing it.

@@ -24,6 +24,7 @@ type Interface interface {
RemoveRuntime(string) error
Save(string) (int64, error)
GetRuntimeConfig(string) (RuntimeConfig, error)
EnableCDI()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Set is only used to enable CDI. Let's remove that from the interface if that's the case.

SetAsDefault bool
RestartMode string
HostRootMount string
RuntimeEnableCDI bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the Runtime* prefix here. Runtime in this context refers to the NVIDIA runtime. I would just use EnableCDI here.

elezar
elezar previously requested changes Dec 20, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

@elezar elezar force-pushed the enable-cdi-toolkit-container branch from 0aaafb4 to d4b8bc5 Compare January 16, 2025 13:58
Comment on lines 58 to 61
if o.RuntimeEnableCDI {
cfg.Set("features", map[string]bool{"cdi": true})
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're already enabling CDI in the Configure call, so there is no need to do it here too.

Suggested change
if o.RuntimeEnableCDI {
cfg.Set("features", map[string]bool{"cdi": true})
}

Comment on lines 135 to 137
if opts.RuntimeEnableCDI {
opts.RuntimeEnableCDI = false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already handle the case that EnableCDI is a no-op in the case of CRI-O.

Suggested change
if opts.RuntimeEnableCDI {
opts.RuntimeEnableCDI = false
}

@@ -103,6 +103,11 @@ func (c Config) DefaultRuntime() string {
return r
}

// EnableCDI sets features.cdi to true in the docker config.
func (c *Config) EnableCDI() {
c.Set("features", map[string]bool{"cdi": true})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope of this PR: I've just thought about the fact that we would overrwrite existing features here.

@elezar elezar force-pushed the enable-cdi-toolkit-container branch from d4b8bc5 to edeeaa4 Compare January 16, 2025 14:39
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from edeeaa4 to 67d6718 Compare January 27, 2025 12:29
cdesiniotis and others added 4 commits January 27, 2025 15:51
This change adds an EnableCDI method to the container engine config files and
Updates the 'nvidia-ctk runtime configure' command to use this new method.

Signed-off-by: Christopher Desiniotis <[email protected]>
@elezar elezar force-pushed the enable-cdi-toolkit-container branch from 67d6718 to 2b417c1 Compare January 27, 2025 14:52
@elezar elezar dismissed their stale review January 27, 2025 15:01

Changes applied.

@elezar
Copy link
Member

elezar commented Jan 27, 2025

@klueska as discussed today, I have rebased this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants