Broadcast hangs on cyt_rdma #202

lawirz · 2024-06-19T11:55:22Z

I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later

Calling Broadcast with 4MB hangs on the second rank.

Rank 0

stdout

Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
rank: 0 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:93.336
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier
host measured durationUs:252146

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	0
                 Card reads sent: 	0
                Card writes sent: 	0
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 738
TX pkgs: 1030
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 654
ROCE TX pkgs: 1028
IBV RX pkgs: 646
IBV TX pkgs: 66566
PSN drop cnt: 0
Retrans cnt: 384
TCP session cnt: 0
STRM down: 0

stderr

XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 92386
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fe00000, Size: 64
calling offload: 7fc95fe00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fc00000, Size: 64
calling offload: 7fc95fc00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f800000, Size: 4194304
calling offload: 7fc95f800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f400000, Size: 4194304
calling offload: 7fc95f400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f000000, Size: 4194304
calling offload: 7fc95f000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95ec00000, Size: 4194304
Broadcasting data from 0...
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95ec00000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fe00000
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fc00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f800000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f400000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f000000

Rank 1

stdout

Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 1] rank 1 size 2 alveo-u55c-08.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 1 receiving remote QP from remote rank 0
Local rank 1 sending local QP to remote rank 0
Queue Pair: id: 0
Local Queue: local: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
Remote Queue: remote: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
rank: 1 FPGA IP: afd4a60
Rendezvous Protocol
sw nop time [us]:86.834
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier

stderr

XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 90744
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-08.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5e00000, Size: 64
calling offload: 7f2da5e00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5c00000, Size: 64
calling offload: 7f2da5c00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5800000, Size: 4194304
calling offload: 7f2da5800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5400000, Size: 4194304
calling offload: 7f2da5400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5000000, Size: 4194304
calling offload: 7f2da5000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 1 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f2da5e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f2da5c00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da4c00000, Size: 4194304
Getting broadcast data from 0...

Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.

The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.

Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.

I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.

Everything works in Simulator, in a variety of scenarios.

The text was updated successfully, but these errors were encountered:

lawirz · 2024-06-19T13:04:52Z

Can confirm, that I observe similar behaviour when running Allreduce in isolation. I tried to run Allreduce with a size of just 2. The first run succeeded. On the secnd run, then the machine started hanging(Can't even reprogram anymore)

lawirz · 2024-06-19T15:29:10Z

I can also confirm, that the issues are not present on the commit before the 196 merge. Merge pull request

quetric · 2024-06-24T12:12:41Z

You linked to #194, do you mean that or the PR that closed issue #196 ?

lawirz · 2024-06-24T12:18:13Z

I mean to say they are probably introduced in the 196-fix. The commit right before is the 194 merge(01f49d2), on which the issue is not present.

quetric · 2024-06-24T12:21:30Z

Can you attach your code here? This doesn't look like it's from any of our tests.

lawirz · 2024-06-24T12:24:17Z

It's the test/host/Coyote/runscripts/run.sh with

TEST_MODE=(5) 
N_ELEMENTS=(1048576) # 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576

quetric · 2024-06-24T12:27:13Z

Does this same test work against the emulator?

lawirz · 2024-06-24T12:29:45Z

I didn't try the equivalent as a isolated testcase. But the emulator works with the ProcessGroup with different sizes and repetitions, while in hardware it shows behaviour like this very quickly

quetric self-assigned this Jun 24, 2024

quetric added the bug Something isn't working label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast hangs on cyt_rdma #202

Broadcast hangs on cyt_rdma #202

lawirz commented Jun 19, 2024

lawirz commented Jun 19, 2024

lawirz commented Jun 19, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024

Broadcast hangs on cyt_rdma #202

Broadcast hangs on cyt_rdma #202

Comments

lawirz commented Jun 19, 2024

Rank 0

Rank 1

lawirz commented Jun 19, 2024

lawirz commented Jun 19, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024

quetric commented Jun 24, 2024

lawirz commented Jun 24, 2024