Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add Aliases in Device spec #86

Open
yeahdongcn opened this issue Oct 24, 2022 · 8 comments
Open

[Feature Request] Add Aliases in Device spec #86

yeahdongcn opened this issue Oct 24, 2022 · 8 comments

Comments

@yeahdongcn
Copy link

I'm using Nvidia Container Toolkit to provision GPU-capable containers for AI training. After I shifted to CDI, I could only use the index of the GPU devices rather than UUID. So I think it would be good if Aliases (or something else) could be introduced into specs.Device. As an end user, then I can specify both index and UUID. From Nvidia Container Toolkit perspective, the user experiences will become more consistent.

@@ -17,6 +17,7 @@ type Spec struct {
 // Device is a "Device" a container runtime can add to a container
 type Device struct {
        Name           string         `json:"name"`
+       Aliases        []string       `json:"aliases"`
        ContainerEdits ContainerEdits `json:"containerEdits"`
 }
@elezar
Copy link
Contributor

elezar commented Oct 26, 2022

@yeahdongcn aliases were part of the original proposal but removed to simplify the API once we started actively developing this. It would defintely be worth including again.

As a matter of interest, are you working in an environment where you need both simultaneously, or would more flexibility in generating the spec be sufficient to cover your use cases?

@yeahdongcn
Copy link
Author

yeahdongcn commented Oct 27, 2022

As a matter of interest, are you working in an environment where you need both simultaneously, or would more flexibility in generating the spec be sufficient to cover your use cases?

I'm still in the stage of migrating from the previous version of Nvidia Container Toolkit to the latest one. nvidia-ctk info generate-cdi is configured to be invoked at system startup as the physical GPU card/slot may get changed during my test.

For validation training, I prefer to use the index, while for performance tuning, I choose to use UUID (looks more accurate to me). This can be done using --gpu previously, but now I need 2 versions of nvidia-ctk and regenerate the CDI config file if I switch between the 2 kinds of test.

@elezar
Copy link
Contributor

elezar commented Mar 16, 2023

@yeahdongcn I was just thinking about this and realized that if you generate two specs with nvidia-ctk cdi generate (the final CLI as of the v1.12.0 relase) then both device names would be available.

sudo nvidia-ctk cdi generate --device-name-strategy=uuid --output /etc/cdi/nvidia-uuid.yaml
sudo nvidia-ctk cdi generate --device-name-strategy=index --output /etc/cdi/nvidia-index.yaml

(the latter being the default)

When checking the specifications you would apear to have double the number of devices, but since CDI is not intended to restrict or count resources in this way, this should not matter. There will also be duplicate nvidia.com/gpu=all devices but this should be handled by the CDI package's conflict resolution.

@yeahdongcn
Copy link
Author

@yeahdongcn I was just thinking about this and realized that if you generate two specs with nvidia-ctk cdi generate (the final CLI as of the v1.12.0 relase) then both device names would be available.

sudo nvidia-ctk cdi generate --device-name-strategy=uuid --output /etc/cdi/nvidia-uuid.yaml
sudo nvidia-ctk cdi generate --device-name-strategy=index --output /etc/cdi/nvidia-index.yaml

(the latter being the default)

When checking the specifications you would apear to have double the number of devices, but since CDI is not intended to restrict or count resources in this way, this should not matter. There will also be duplicate nvidia.com/gpu=all devices but this should be handled by the CDI package's conflict resolution.

@elezar Thank you for letting me know about the new usage. Do you mean after the CDI spec generation, /etc/cdi/nvidia-uuid.yaml and /etc/cdi/nvidia-index.yaml will both be picked by Nvidia Container Toolkit? I'll give it a try later.

@elezar
Copy link
Contributor

elezar commented Mar 17, 2023

Any CDI client (consumer) such as podman, crio, containerd, or the nvidia-container-runtime in CDI mode loads all spec files to determine what valid CDI devices exist. Any of these will then have loaded both of the specs and see both device names as valid.

@elezar
Copy link
Contributor

elezar commented Mar 21, 2023

@yeahdongcn I have just done a quick test myself, and the duplicate all devices in the two specs generated by the commands above will cause issues when injecting devices. This means that this should be removed from at least one of them.

We will update our tooling to make this easier.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2024
@elezar elezar reopened this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants