Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify that the measurement is unidirectional #189

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stas00
Copy link

@stas00 stas00 commented Dec 20, 2023

This PR adds a clarification that the measurement is done against a unidirectional bandwidth and not duplex.

Otherwise it's not clear to the user whether they should compare the outcome to the unidirectional or duplex advertised performance of the network.

So when I run nccl-tests on intra-node A100 with NVLink 3, I get best throughput around 235GBps - which is out of 300GBps advertised unidirectional peak performance and not 600GBps duplex. ~80% checks out, but 40% doesn't.

Thank you.

@stas00
Copy link
Author

stas00 commented Jan 2, 2024

@AddyLaddy, do you know who should I tag for this PR to be noticed please? Thank you!

@AddyLaddy
Copy link
Collaborator

I'm not sure I'd describe it as "a unidirectional bandwidth". The tests are usually bidirectional, but we don't double the BW figure when it's reported. I'll see if we can make that clearer in the document.

@stas00
Copy link
Author

stas00 commented Jan 2, 2024

I'd be totally happy with any other phrasing, @AddyLaddy - the key is to be able to know which of the advertised hardware specs to compare with the nccl-tests' outcome.

And here what it measures is the unidirectional way, because if I run nccl-tests on a 300GBps unidirectional / 600GBps duplex NVLink node I get about 240GBps reported by nccl-tests with high payload.

So I must compare the nccl-tests results to 300GBps (~80% throughput) and not 600GBps (40% throughput). Hence: it's unidirectional.

Bottom line: there is a need for clarity of what type of spec the user compares the reports from nccl-tests to.

We know that ~80% of theoretical spec is normal, and at 40% - something is wrong.

I hope my presentation of the ambiguity/confusion the user faces here is clear.

@AddyLaddy
Copy link
Collaborator

I think users should run a utility like nvbandwdith if they want to see bidirectional performance numbers that match the advertised NVLink memory bandwidth. NCCL is a collective communication library which does more than just simple memcpy like data movements. Also, in the comms world it's not that common to quote the bidirectional BW, for example I wouldn't expect a 400Gbps CX-7 IB card to report 800Gbps when we measure its performance with the perftest utilities.

@stas00
Copy link
Author

stas00 commented Jan 3, 2024

yes, but we want to measure the nccl collectives performance, which is essential for doing math to know the overhead of collectives. And in my experience so far it's always "less" than the low-level benchmarks.

I have already validated the node's low-level performance with https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest which looks similar to https://github.com/NVIDIA/nvbandwidth (for a single node)

For context here is what I'm trying to accomplish: https://github.com/stas00/ml-engineering/tree/master/network#real-network-throughput

Again, the need is very simple: the context: I, the user, am evaluating whether this provider's throughput or that particular node setup is sufficient for my ML training needs.

So the first thing I do is run an all_reduce benchmark and I get some number.

Now I need to know which number do I compare against to know whether my setup is configured correctly or not.

In the case of benchmarking against an intra-node solution that would be either uni- or bi-directional spec reported by the hardware vendor.

e.g. on AWS p4 nodes I have seen where missing some env var could make the nccl throughput 4x slower, because it wasn't correctly engaging the proprietary EFA network. So it's crucial that I know what spec I need to compare the nccl report with.

For another context please see https://twitter.com/StasBekman/status/1736870712584056911

Is that any more helpful?

Thank you, @AddyLaddy

@stas00
Copy link
Author

stas00 commented Jan 3, 2024

Another example: IB specs report unidirectional speeds: https://en.wikipedia.org/wiki/InfiniBand#Performance - it wasn't there before - I had to actually go and modify that wikipage because it didn't provide any indication of whether it was duplex or unidirectional - how is the user to know?

(I changed it after researching many other IB specs online and validating that these numbers were indeed uni-directional)

@stas00
Copy link
Author

stas00 commented Feb 9, 2024

Any chance this task could be resolved and the very essential nuance be documented? You didn't like my proposal so please kindly make a new one and suggest one here and I will update the PR with it if it makes things easier.

I see that AMD picked up NVIDIA's strategies and also advertises duplex throughput on their MI* GPU offerings.

But nccl-tests reports unidirectional speed, so any nccl-tests user will be puzzling over and complaining to their provider why do they get only half of the advertised speed. Let's resolve this by telling users loud and clear that they need to compare the output of nccl-tests to unidirectional peak performance of the hardware spec.

Thank you so much!

@sjeaugey
Copy link
Member

sjeaugey commented Feb 9, 2024

Network cards have always been talking about the data rate, as the low-level signaling speed on the wire. A 100Mbps network card works at 100Mbps max in each direction. If it's full duplex, then it can do 100Mbps in both directions simultaneously but despite some attempts to market that as a "200Mbps" network card, vendors have sticked to the signaling rate. So I don't think it was necessary to change the Infiniband wikipedia page, given this is talking about the data rate of the transceiver.
The page mentioned the links were duplex.

As for the NCCL perf tests, NCCL operations being collective operations, it is complicated to start adding the bandwidth going in multiple directions. To go further, we also had users request that we multiply the number by the number of ranks, reflecting the total amount of bandwidth transferred per second across all ranks; since some switch vendors advertise the total bandwidth summing up the bandwidth from all the ports.

So we have to go with some measurement which makes the most sense to us, and in our case it is the size of the data fed to the collective operation divided by the time (for broadcast, reduce) which is the definition of a bandwidth. We then multiply that algorithm bandwidth by a correction factor for algorithms like Allreduce which need to transfer more (or less) than the data size, and where that factor depends on the number of ranks, so that we get a number we can compare against some HW characteristics, while also comparing e.g. the broadcast bandwidth and the allreduce bandwidth. On many systems, all collectives should have the same peak BusBw (when using similar algorithms). Doubling the allreduce BusBw would mean also doubling e.g. the broadcast BusBw, which didn't seem intuitive to us.

There is probably a way to express that in the documentation -- but "unidirectional" would be more confusing to me as some may think data is only flowing in one direction on each link.

@stas00
Copy link
Author

stas00 commented Feb 9, 2024

Sylvain, I always appreciate your clear commentary.

I totally understand what you're saying and in no way I was suggesting that something needs to change on nccl-tests side.

Yet, since vendors confuse us users, we need help in knowing what should we compare the output of nccl-tests against. I routinely use these tests to detect problems in the network setups and in order to detect these problems I need to know what the ceiling is.

I hear that you feel that me latching onto uni- vs. bi- directional terminology as a way to tell which is the ceiling the user should compare against is not intuitive to you. Then please suggest what that term would be.

The context/need is simple. Vendors often advertising double speed because it looks better. How is the user to know what nccl-tests busbw output should be compared to.

Moreover, people relaying the spec information often forget to qualify what type of speed they are referring to. Someone recently talked to me about AMD MI300X having 896GBps intra-node speed, w/o any qualification if it's was duplex or not. It proved to be duplex, but I had to research that. And I needed to translate that to 448GBps to check the efficiency when I test with nccl-tests. I realize it wasn't done out of misdirection but because the person I was talking to didn't realize this small detail was very important and they just remembered the number from the spec page, but not what it meant.

@stas00
Copy link
Author

stas00 commented Feb 9, 2024

So I don't think it was necessary to change the Infiniband wikipedia page, given this is talking about the data rate of the transceiver.

FWIW, https://en.wikipedia.org/wiki/PCI_Express#Comparison_table clearly specifies that the numbers are:

"In each direction (each lane is a dual simplex channel)."

snapshot_603

The page mentioned the links were duplex.

It does, because yours truly added it. It took me much research to validate that they were duplex indeed, since the original version had no indication whatsoever what those numbers were referring to.

Why does the PCIe wiki page uses per direction speed, whereas the IB wiki page publishes only duplex numbers? How can users make sense of it when comparing technologies.

@stas00
Copy link
Author

stas00 commented Feb 9, 2024

I'm working on solving this problem with ambiguity by building my own wiki here https://github.com/stas00/ml-engineering/tree/master/network by explicitly showing both uni- and bi-directional speeds, e.g. here is PCIe and NVlink

snapshot_604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants