Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TAS plugin with Kueue #3534

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

annuay-google
Copy link
Contributor

@annuay-google annuay-google commented Jan 14, 2025

What?

  • Replace TAS plugin with Kueue for job scheduling
  • Deprecate TAS plugin (from A3U blueprints) - full module deprecation to follow soon
  • Update Jobset based NCCL test to use Kueue TAS in place of TAS plugin
  • Add default Kueue configuration

Why?

Kueue is the officially supported job scheduling utility for A3 Ultra

Testing

Ran Jobset based NCCL test, verified bandwidth figures

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@annuay-google annuay-google changed the base branch from main to develop January 14, 2025 22:14
@annuay-google annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Jan 14, 2025
@annuay-google annuay-google added release-key-new-features Added to release notes under the "Key New Features" heading. and removed release-improvements Added to release notes under the "Improvements" heading. labels Jan 14, 2025
Copy link
Contributor

@ankitkinra ankitkinra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete the 2 node nccl test ?

- id: workload-manager-install
source: modules/management/kubectl-apply
use: [a3-ultragpu-cluster]
settings:
kueue:
install: true
version: v0.10.0
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(a3-ultragpu-pool.total_gpu_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super, so just to confirm if the nodepool size is changed via the toolkit, this config will be updated ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes- this number is fetched as an output of the node pool module. Calculated as allocatable_gpu_per_node * static_node_count. This is done so that user doesn't have to pass a redundant input (num_gpus)

@annuay-google
Copy link
Contributor Author

Do we need to delete the 2 node nccl test ?

Deleting this as well, since we have an n-node test now

@annuay-google
Copy link
Contributor Author

Do we need to delete the 2 node nccl test ?

Deleting this as well, since we have an n-node test now

Deleted

@annuay-google annuay-google added the do-not-merge Block merging of this PR label Jan 14, 2025
@@ -50,6 +50,11 @@ output "allocatable_gpu_per_node" {
value = local.allocatable_gpu_per_node
}

output "total_gpu_count" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this should be static_gpu_count. Might have implications down the road for configuring Kueue with DWS Flex, since DWS Flex + Kueue could be configured with the total maximum scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge Block merging of this PR release-key-new-features Added to release notes under the "Key New Features" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants