-
Notifications
You must be signed in to change notification settings - Fork 6
Add new project proposal to describe nvlink + topology aware scheduling #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8ba0fce to
8ff847e
Compare
|
This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that. |
|
Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not. This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed |
projects/PROJ-206-nvlink-support/PROJ-206-topology-aware-scheduling.md
Outdated
Show resolved
Hide resolved
|
Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available. |
|
Yeah, we need a more comprehensive test workflow that is sensitive to network speed. Perhaps we can use https://github.com/NVIDIA/nccl-tests instead of doing model inference/training. I added an open item for this |
Good idea, but it will take some design to do this right. I've added an open item for this in the topology-aware-scheduling doc |
You can opt not to use it by setting |
Description
Add project design document for nvlink + topology aware scheduling support
Issue #206
Checklist