-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather wrong order #198
Comments
I've merged the 196 dev branch into dev since the bugs solved there were quite severe. So I'll be handling this as a new bug on dev. |
I verified my statement about recompiling test.cpp. When running on the driver and the bitstream generated by the dev branch and only recompiling test.cpp:
|
@lawirz can you confirm what count you were using for the above failed test? |
Count of 24 |
I see you set max eager size to 64B so this 24-float (192B) gather executes with rendezvous. Can you please increase the max eager size to something larger than 192B, and rerun the test? Let me know if the problem persists. |
I now initialized using: |
Can you run each 10 times and report how many fail? Lucian and I had a look in the two versions you are pointing and there seems to be nothing that had changed (I fixed the TCP session handler) that could cause this break. |
Results(1 means test succeeded): accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 1024, options.seg_size);
accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 64, options.seg_size);
The result seems to be dependent on utilization. I am currently almost alone on the cluster and this is the first time I'm getting this amount of successes. The behaviour I initially observed might just have been due to this effect. The count was still 24. I tried to avoid false negatives due to filesystem errors. The script I used:
|
Thanks for running these - I don't think this is a utilisation / congestion issue though. The networking stacks ACCL uses (RDMA from Coyote, TCP/IP from EasyNet) both have retransmission if I am not mistaken, so any congestion in the cluster should not cause these issues. Also, such issues are quite low level, so I would expect them to also have an impact on other collectives, not just gather. Could this be some race condition? |
This issue concerns the branch to resolve issue 196: https://github.com/Xilinx/ACCL/tree/196-reduceallreduce-issues-on-cyt_rdma
Gather sometimes switches up the output of the first rank and the second rank on two-node setups, when run on cyt_rdma. The error is not observed in the emulator setup. In HW, it only happens in around 50% of runs.
Allgather on the other hand doesn't produce erronous behaviour.
It only occured after recompiling test/host/Coyote/test.cpp. The binary compiled on the previous version running with a new bitstream worked.
Rank 0
stdout
stderr
Rank 1
stdout
stderr
The text was updated successfully, but these errors were encountered: