-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest skypilot image does not support azure accelerated networking and nccl #4448
Comments
cc @yika-luo |
There seems to be some compatibility issues between Azure's accelerated networking and the Nvidia's NCCL configured on SkyPilot custom image. I sought help from Azure support and here's the response:
So the recommendation is to either use another image or other optimized VM types. |
Hi @yika-luo, thanks for looking into this! I'm a bit confused as this was working before with an older skypilot image. I have provided instructions above to replicate the older stack under Some other notes:
|
Enabling azure accelerating networking with the latest skypilot image breaks the nccl test.
Enabling accelerated networking:
Updating
nccl_test.py
for azure/debugging:Output:
Note this part which seems to be the root cause:
Interestingly, if I revert to an older image, it works again:
Note that with the older image you might have to add:
before running the test.
Updated diff:
This leads me to think it is something related to the image.
Accelerated networking is needed to obtain reliable high-bandwidth interconnect for jobs such as dtrain.
Version & Commit info:
sky -v
: skypilot, version 0.7.0sky -c
: skypilot, commit 3f62588-dirtyThe text was updated successfully, but these errors were encountered: