Add new project proposal to describe nvlink + topology aware scheduling #211

ecolternv · 2026-01-08T21:36:11Z

Description

Add project design document for nvlink + topology aware scheduling support

Issue #206

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

RyaliNvidia · 2026-01-15T00:44:34Z

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

RyaliNvidia · 2026-01-15T00:46:49Z

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

projects/PROJ-206-nvlink-support/PROJ-206-topology-aware-scheduling.md

RyaliNvidia · 2026-01-15T00:55:22Z

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

github-actions · 2026-01-26T21:24:30Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/OSMO/pr-preview/pr-211/
Built to branch `gh-pages/documentation` at 2026-01-26 21:48 UTC. Preview will be ready when the GitHub Pages deployment is complete.

ecolternv · 2026-01-26T21:41:49Z

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

Yeah, we need a more comprehensive test workflow that is sensitive to network speed. Perhaps we can use https://github.com/NVIDIA/nccl-tests instead of doing model inference/training.

I added an open item for this

ecolternv · 2026-01-26T21:43:59Z

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

Good idea, but it will take some design to do this right. I've added an open item for this in the topology-aware-scheduling doc

ecolternv · 2026-01-26T21:46:03Z

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

You can opt not to use it by setting NCCL_MNNVL_ENABLE=0. In that case the compute domain just wont be used, which shouldn't cause any problems as far as I'm aware.

ecolternv requested a review from a team January 8, 2026 21:36

ecolternv added 2 commits January 14, 2026 15:29

Add new project proposal to describe nvlink + topology aware scheduling

f0004d1

Split design into two docs

8ff847e

ecolternv force-pushed the ecolter/nvlink-project-design branch from 8ba0fce to 8ff847e Compare January 14, 2026 20:29

RyaliNvidia closed this Jan 15, 2026

RyaliNvidia reopened this Jan 15, 2026

RyaliNvidia reviewed Jan 15, 2026

View reviewed changes

projects/PROJ-206-nvlink-support/PROJ-206-topology-aware-scheduling.md Outdated Show resolved Hide resolved

Finish docs and add some updates from feedback

8328a51

ecolternv temporarily deployed to internal-ci January 26, 2026 21:23 — with GitHub Actions Inactive

Add some open items

d7cf020

ecolternv temporarily deployed to internal-ci January 26, 2026 21:47 — with GitHub Actions Inactive

RyaliNvidia approved these changes Jan 26, 2026

View reviewed changes

elookpotts-nvidia approved these changes Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new project proposal to describe nvlink + topology aware scheduling #211

Add new project proposal to describe nvlink + topology aware scheduling #211

ecolternv commented Jan 8, 2026

Uh oh!

RyaliNvidia commented Jan 15, 2026

Uh oh!

RyaliNvidia commented Jan 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

RyaliNvidia commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 •

edited

Loading

Built to branch `gh-pages/documentation` at 2026-01-26 21:48 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add new project proposal to describe nvlink + topology aware scheduling #211

Are you sure you want to change the base?

Add new project proposal to describe nvlink + topology aware scheduling #211

Conversation

ecolternv commented Jan 8, 2026

Description

Checklist

Uh oh!

RyaliNvidia commented Jan 15, 2026

Uh oh!

RyaliNvidia commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RyaliNvidia commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages/documentation at 2026-01-26 21:48 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

ecolternv commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RyaliNvidia commented Jan 15, 2026 •

edited

Loading

github-actions bot commented Jan 26, 2026 •

edited

Loading

Built to branch `gh-pages/documentation` at 2026-01-26 21:48 UTC.
Preview will be ready when the GitHub Pages deployment is complete.