Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for GPU resources #1701

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

miguelvr
Copy link
Contributor

@miguelvr miguelvr commented Dec 2, 2024

Description, Motivation and Context

nvidia.com/gpu is a common resource for many AI applications, but currently, Troubleshoot does not allow users to check the availability of GPUs in their cluster.

I've considered taking a more generic approach with a function resourceName(<resource>), but decided against it because it would either require taking in the units as a parameter resourceName(<resource>, <units>) or default to something like DecimalSI, which could lead to errors for certain resources.

For simplicity, I decided to go with the more explicit approach.

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@miguelvr miguelvr requested a review from a team as a code owner December 2, 2024 11:59
@marccampbell marccampbell added the type::feature New feature or request label Dec 2, 2024
@banjoh
Copy link
Member

banjoh commented Dec 2, 2024

Thanks for your contribution. We'll review the PR

Copy link
Member

@banjoh banjoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You took the right direction, but I'm concerned that a GPU device only considers Nvidia hardware. Users who have gpu.intel.com/i915 or gpu.intel.com/xe for example, cannot use this analyser.

@banjoh
Copy link
Member

banjoh commented Dec 2, 2024

You took the right direction, but I'm concerned that a GPU device only considers Nvidia hardware. Users who have gpu.intel.com/i915 or gpu.intel.com/xe for example, cannot use this analyser.

I don't yet have a clean suggestion that works, but I'd explore combining filters and outcomes.

@xavpaice
Copy link
Member

xavpaice commented Dec 2, 2024

@xavpaice
Copy link
Member

xavpaice commented Dec 2, 2024

You took the right direction, but I'm concerned that a GPU device only considers Nvidia hardware. Users who have gpu.intel.com/i915 or gpu.intel.com/xe for example, cannot use this analyser.

I think at this stage we can document the limitation and work with just Nvidia, expanding the scope in a future PR if there's demand.

@miguelvr
Copy link
Contributor Author

miguelvr commented Dec 2, 2024

The more generic approach would be fetching by resource name. However it would have to be a function resourceCapacity(nvidia.com/gpu, DecimalSI)

This would cover any custom resource added by any device plugin.

I started on this, but I was having a hard time getting the regex to work as I wanted it to

@xavpaice
Copy link
Member

xavpaice commented Dec 3, 2024

#1162 is related

@chris-sanders
Copy link
Member

@miguelvr we were reviewing this and issue #1162 from above and wanted to get your input on this approach.

https://docs.google.com/document/d/1thoEmuHpzf961cHHRreNjVB1rC_u8CVZYBCfxYNg8mY/edit?tab=t.0#heading=h.1q5hq5a3s115

I think that would be a bit more flexible to address the other open issue but want to confirm it would resolve your issue as well. I think it's what you're doing, but just a bit more flexible for the future. I'd like you to review that though and let us know what you think :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants