Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding GPU support #289

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

etiennedub
Copy link
Contributor

@etiennedub etiennedub commented Feb 8, 2024

It's based on MIG changes because MIG PR changes a bit the way the GPU are configured.

I added one parameter that set the number of sharding for the whole. The shard are evenly split between GPUs on the node. Initially, I wanted to set the set the shard number per GPU but it was complicated to configure, even more considering the MIG setup.
This PR add a new parameter to each infra to set the "shard" number similarly to the MIG configuration. If we prefer, we could set the shard number from profile::slurm::base directly with the hieradata instead.

Related Puppet PR: ComputeCanada/puppet-magic_castle#322

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Apr 9, 2024

@etiennedub Can you rebase this and fix conflict now that MIG PRs have been merged?

@etiennedub etiennedub changed the base branch from mig to main April 12, 2024 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants