Skip to content

Conversation

@ecolternv
Copy link
Contributor

Description

Add project design document for nvlink + topology aware scheduling support

Issue #206

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ecolternv ecolternv requested a review from a team January 8, 2026 21:36
@ecolternv ecolternv force-pushed the ecolter/nvlink-project-design branch from 8ba0fce to 8ff847e Compare January 14, 2026 20:29
@RyaliNvidia
Copy link
Contributor

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

@RyaliNvidia
Copy link
Contributor

RyaliNvidia commented Jan 15, 2026

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

@RyaliNvidia RyaliNvidia reopened this Jan 15, 2026
@RyaliNvidia
Copy link
Contributor

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

@github-actions
Copy link

github-actions bot commented Jan 26, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/OSMO/pr-preview/pr-211/

Built to branch gh-pages/documentation at 2026-01-26 21:48 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@ecolternv
Copy link
Contributor Author

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

Yeah, we need a more comprehensive test workflow that is sensitive to network speed. Perhaps we can use https://github.com/NVIDIA/nccl-tests instead of doing model inference/training.

I added an open item for this

@ecolternv
Copy link
Contributor Author

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

Good idea, but it will take some design to do this right. I've added an open item for this in the topology-aware-scheduling doc

@ecolternv
Copy link
Contributor Author

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

You can opt not to use it by setting NCCL_MNNVL_ENABLE=0. In that case the compute domain just wont be used, which shouldn't cause any problems as far as I'm aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants