-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast hangs on cyt_rdma #202
Comments
Can confirm, that I observe similar behaviour when running Allreduce in isolation. I tried to run Allreduce with a size of just 2. The first run succeeded. On the secnd run, then the machine started hanging(Can't even reprogram anymore) |
I can also confirm, that the issues are not present on the commit before the 196 merge. Merge pull request |
I mean to say they are probably introduced in the 196-fix. The commit right before is the 194 merge(01f49d2), on which the issue is not present. |
Can you attach your code here? This doesn't look like it's from any of our tests. |
It's the test/host/Coyote/runscripts/run.sh with
|
Does this same test work against the emulator? |
I didn't try the equivalent as a isolated testcase. But the emulator works with the ProcessGroup with different sizes and repetitions, while in hardware it shows behaviour like this very quickly |
I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later
Calling Broadcast with 4MB hangs on the second rank.
Rank 0
stdout
stderr
Rank 1
stdout
stderr
Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.
The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.
Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.
I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.
Everything works in Simulator, in a variety of scenarios.
The text was updated successfully, but these errors were encountered: