fix: nvls all reduce correction factor #239
Open
+11
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was running single server H100 (8xH100 SXM)
nccl-tests
and saw that the Bus BW480Gbyte/s
even tho the line rate is450Gbyte/s
. I was confused and looked further into how bus BW is calcuated and it seems like it is calculated incorrectly for in network reduction algos.According to #212 (comment) , The acutal correction factor should be
bus_bw = algo_bw * (n-1)/(n+1)
instead ofbus_bw = algo_bw * 2(n-1)/n
This PR is probably not mergable since
NCCL_ALGO
can be auto picked or be contained in/etc/nccl.conf
and there doesn't seem to have an API for seeing what algonccl
has chose. Correction factors forCollnetDirect
andCollnetChain
on the IB Network probably needs to be updated too.But just wanted to put it here in case anyone else in the community is confused about how bus bw could be 106% faster than peak theoretical line rate.
Command
NCCL_ALGO=NVLS ./build/all_reduce_perf -b 8K -e 8G -f 2 -g 8
Before
After
Factor vs number of ranks
NVLS read/write