-
Notifications
You must be signed in to change notification settings - Fork 78
Update SkyPilot to use volume mounts and added support for multi-node #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ansjindal <[email protected]>
| if cloud == "k8s": | ||
| # VolumeConfig region and zone required even though they are marked as optional | ||
| # validation fails otherwise | ||
| config["cloud"] = "kubernetes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value needs to be kubernetes - based on the providers supported list in skypilot
|
Can you rebase on latest main? Skypilot was just updated to 0.10.0 there. |
Signed-off-by: ansjindal <[email protected]>
Signed-off-by: ansjindal <[email protected]>
d6727d8 to
26bec77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some basic unit tests for the codecov CI check to pass? Thanks a lot for your contribution.
nemo_run/core/execution/skypilot.py
Outdated
| launcher = self.launcher | ||
| # Dynamic rendezvous has an error in Skypilot Kubernetes currently | ||
| if ( | ||
| launcher | ||
| and isinstance(launcher, (Torchrun, FaultTolerance)) | ||
| and self.cloud == "kubernetes" | ||
| ): | ||
| launcher.rdzv_backend = "static" | ||
| launcher.rdzv_port = 49500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this part anymore, its fixed in the latest version of Skypilot.
Signed-off-by: ansjindal <[email protected]>
Signed-off-by: ansjindal <[email protected]>
|
is there something needed more on this PR? |
volume_mountsto use pre-created 'nemo-workspace' PVC mapping to K8s Volume Mounts.volumesto use volumes in the pod.SkypilotExecutordid not support the launcher API (supports_launcher_transform() = False), preventing it from using torchrun for multi-node distributed training.This PR enables launcher API support in
SkypilotExecutorChanges Made
supports_launcher_transform() -> Truelauncher="torchrun"whennum_nodes > 1.Older PR got closed by mistake: #296
Example:
Some logs: