-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated send/recv gets stuck #139
Comments
Hi @Mellich Could you clarify this: "It occurs using the UDP and TCP stack." I can see a mechanism whereby backpressure from the RX pipeline causes the UDP stack to drop packets. This shouldn't happen for the TCP stack though. Please add ILAs to the input and output of the UDP stack to give us visibility into what's going on, and share the waveforms. |
I executed the same experiment with another bitstream containing the TCP stack. Also there, the execution gets stuck after several iterations when large message sizes are used. |
I have been able to replicate this bug on our infrastructure with the following settings: I have changed @Mellich his code a little bit in my branch so that it automatically prints some statistics if the send/recv gets stuck. I've attached the statistics that I got from our infrastructure. Most notably, the CMAC numbers seem to match, but the Network Layer numbers don't. Is this maybe a problem with VNx instead of ACCL? (This of course doesn't explain the TCP problems) Rank 0:
Rank 1:
|
Traced this problem back to ACCL applying backpressure into the POE which causes packet loss with UDP and with TCP if RX bypass is enabled. The current work-around is to use TCP and disable RX bypass. I made this configuration default for now. Keeping the issue open while I debug the cause for the backpressure |
Repeated calls of send/recv of the following form get stuck after several iterations on two ranks:
The behavior seems to be non-deterministic and may only appear with large message sizes and high numbers of repetitions.
It happens more frequently for FPGA-to-FPGA communication as shown in the example but I also observed it for CPU-to-CPU via ACCL. It occurs using the UDP and TCP stack.
Setting a sufficiently long sleep between the iterations seems to increase stability.
I modified the XRT tests to show the described behavior here: https://github.com/Mellich/ACCL/blob/f805e8f87a91878228173668553ce25f9b9eaa31/test/host/xrt/test.cpp#L347
Using the branch above, the test gets stuck reliably for me qwhen executed with the following command:
Example dump of CMAC and network layer status of the UDP version after the execution got stuck:
Message size: 1MB
Output of test:
Rank 0:
Rank 1:
All send packets are also received by the network layer of the other rank, so no data seems to get lost over the link. However, there is a discrepancy between sent and received packets on both ranks. Shouldn't the count of packets be the same in this scenario?
The
recv
should block the subsequent send, so rx and tx should stay in balance.The text was updated successfully, but these errors were encountered: