Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather wrong order #198

Open
lawirz opened this issue May 29, 2024 · 9 comments
Open

Gather wrong order #198

lawirz opened this issue May 29, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@lawirz
Copy link

lawirz commented May 29, 2024

This issue concerns the branch to resolve issue 196: https://github.com/Xilinx/ACCL/tree/196-reduceallreduce-issues-on-cyt_rdma

Gather sometimes switches up the output of the first rank and the second rank on two-node setups, when run on cyt_rdma. The error is not observed in the emulator setup. In HW, it only happens in around 50% of runs.

Allgather on the other hand doesn't produce erronous behaviour.

It only occured after recompiling test/host/Coyote/test.cpp. The binary compiled on the previous version running with a new bitstream worked.

Rank 0

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-04.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.80
10.253.74.92
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50,
Remote Queue: remote: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c,
rank: 0 FPGA IP: afd4a50
Rendezvous Protocol
sw nop time [us]:92.656
hw nop time [ns]:940
Start gather test with root 0...
Repetition 0
Pass accl barrier
host measured durationUs:42.371
1th item is incorrect! (24.000000 != 0.000000)
2th item is incorrect! (25.000000 != 1.000000)
3th item is incorrect! (26.000000 != 2.000000)
4th item is incorrect! (27.000000 != 3.000000)
5th item is incorrect! (28.000000 != 4.000000)
6th item is incorrect! (29.000000 != 5.000000)
7th item is incorrect! (30.000000 != 6.000000)
8th item is incorrect! (31.000000 != 7.000000)
9th item is incorrect! (32.000000 != 8.000000)
10th item is incorrect! (33.000000 != 9.000000)
11th item is incorrect! (34.000000 != 10.000000)
12th item is incorrect! (35.000000 != 11.000000)
13th item is incorrect! (36.000000 != 12.000000)
14th item is incorrect! (37.000000 != 13.000000)
15th item is incorrect! (38.000000 != 14.000000)
16th item is incorrect! (39.000000 != 15.000000)
17th item is incorrect! (40.000000 != 16.000000)
18th item is incorrect! (41.000000 != 17.000000)
19th item is incorrect! (42.000000 != 18.000000)
20th item is incorrect! (43.000000 != 19.000000)
21th item is incorrect! (44.000000 != 20.000000)
22th item is incorrect! (45.000000 != 21.000000)
23th item is incorrect! (46.000000 != 22.000000)
24th item is incorrect! (47.000000 != 23.000000)
1th item is incorrect! (0.000000 != 24.000000)
2th item is incorrect! (1.000000 != 25.000000)
3th item is incorrect! (2.000000 != 26.000000)
4th item is incorrect! (3.000000 != 27.000000)
5th item is incorrect! (4.000000 != 28.000000)
6th item is incorrect! (5.000000 != 29.000000)
7th item is incorrect! (6.000000 != 30.000000)
8th item is incorrect! (7.000000 != 31.000000)
9th item is incorrect! (8.000000 != 32.000000)
10th item is incorrect! (9.000000 != 33.000000)
11th item is incorrect! (10.000000 != 34.000000)
12th item is incorrect! (11.000000 != 35.000000)
13th item is incorrect! (12.000000 != 36.000000)
14th item is incorrect! (13.000000 != 37.000000)
15th item is incorrect! (14.000000 != 38.000000)
16th item is incorrect! (15.000000 != 39.000000)
17th item is incorrect! (16.000000 != 40.000000)
18th item is incorrect! (17.000000 != 41.000000)
19th item is incorrect! (18.000000 != 42.000000)
20th item is incorrect! (19.000000 != 43.000000)
21th item is incorrect! (20.000000 != 44.000000)
22th item is incorrect! (21.000000 != 45.000000)
23th item is incorrect! (22.000000 != 46.000000)
24th item is incorrect! (23.000000 != 47.000000)
48 errors!

ERROR: ACCL base functionality test failed!

STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	2
                 Card reads sent: 	1
                Card writes sent: 	1
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 50
TX pkgs: 5
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 3
ROCE TX pkgs: 3
IBV RX pkgs: 6
IBV TX pkgs: 4
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...
stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 256891
UID: 500207
[Wed May 29 21:24:18 2024 GMT]
HOST: alveo-u55c-04.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 4147289406 at 0x0
CCLO source commit (first 24b): f7329d
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f197f600000, Size: 64
calling offload: 7f197f600000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f197f400000, Size: 64
calling offload: 7f197f400000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197f000000, Size: 4194304
calling offload: 7f197f000000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197ec00000, Size: 4194304
calling offload: 7f197ec00000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197e800000, Size: 4194304
calling offload: 7f197e800000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f197f600000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f197f400000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1
Allocation successful! Allocated buffer: 7f197e600000, Size: 96
CoyoteBuffer contructor called! page_size:2097152, buffer_size:192,n_pages:1
Allocation successful! Allocated buffer: 7f197e400000, Size: 192
Gather data from 0...
Free user buffer from cProc cPid:0, buffer_size:96,7f197e600000
Free user buffer from cProc cPid:0, buffer_size:192,7f197e400000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 1, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f197f600000 	 status: ENQUEUED 	 occupancy: 96/64 	 MPI tag: ffffffff 	 seq: 0 	 src: 1
Spare RX Buffer 1:	 address: 0x7f197f400000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7f197f600000
Free user buffer from cProc cPid:0, buffer_size:64,7f197f400000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197f000000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197ec00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197e80000

Rank 1

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 1] rank 1 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.80
10.253.74.92
Initializing QP connections...
Exchanging QP...
Local rank 1 receiving remote QP from remote rank 0
Local rank 1 sending local QP to remote rank 0
Queue Pair: id: 0
Local Queue: local: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50,
rank: 1 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:73.61
hw nop time [ns]:940
Start gather test with root 0...
Repetition 0
Pass accl barrier
host measured durationUs:91.063

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	0
                 Card reads sent: 	0
                Card writes sent: 	0
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 48
TX pkgs: 5
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 3
ROCE TX pkgs: 3
IBV RX pkgs: 4
IBV TX pkgs: 6
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...
stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 286334
UID: 500207
[Wed May 29 21:24:18 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 4147289406 at 0x0
CCLO source commit (first 24b): f7329d
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f431c000000, Size: 64
calling offload: 7f431c000000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f4317e00000, Size: 64
calling offload: 7f4317e00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317a00000, Size: 4194304
calling offload: 7f4317a00000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317600000, Size: 4194304
calling offload: 7f4317600000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317200000, Size: 4194304
calling offload: 7f4317200000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 1 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f431c000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f4317e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1
Allocation successful! Allocated buffer: 7f4317000000, Size: 96
Gather data from 1...
Free user buffer from cProc cPid:0, buffer_size:96,7f4317000000
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1
> rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f431c000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f4317e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7f431c000000
Free user buffer from cProc cPid:0, buffer_size:64,7f4317e00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317a00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317600000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317200000
@quetric
Copy link
Collaborator

quetric commented May 30, 2024

I've merged the 196 dev branch into dev since the bugs solved there were quite severe. So I'll be handling this as a new bug on dev.

@quetric quetric changed the title Gather wrong order on branch to solve 196 Gather wrong order May 30, 2024
@quetric quetric self-assigned this May 30, 2024
@quetric quetric added the bug Something isn't working label May 30, 2024
@lawirz
Copy link
Author

lawirz commented Jun 11, 2024

I verified my statement about recompiling test.cpp.

When running on the driver and the bitstream generated by the dev branch and only recompiling test.cpp:

@quetric
Copy link
Collaborator

quetric commented Jun 11, 2024

@lawirz can you confirm what count you were using for the above failed test?

@lawirz
Copy link
Author

lawirz commented Jun 11, 2024

The default count of 16

Count of 24

@quetric
Copy link
Collaborator

quetric commented Jun 11, 2024

I see you set max eager size to 64B so this 24-float (192B) gather executes with rendezvous. Can you please increase the max eager size to something larger than 192B, and rerun the test? Let me know if the problem persists.

@lawirz
Copy link
Author

lawirz commented Jun 13, 2024

I now initialized using:
accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 1024, options.seg_size);
I still get the error.
Maybe it's sheer chance, but I had to repeat it 7 times to produce the error. Typically I got it in the first run before, so it might be dependent on other factors. I only ran the test on the hls code compatibility with Vitis 2023+ commit around 5 times, so I'm not 100% sure it always works there. Should I try it a few times more there to make sure?

@bo3z
Copy link
Contributor

bo3z commented Jun 14, 2024

Can you run each 10 times and report how many fail? Lucian and I had a look in the two versions you are pointing and there seems to be nothing that had changed (I fixed the TCP session handler) that could cause this break.

@lawirz
Copy link
Author

lawirz commented Jun 14, 2024

Results(1 means test succeeded):

accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 1024, options.seg_size);

  • dev~1
    1111111111
  • docu for vadd exampl
    1111111111
  • hls code compatibility with Vitis 2023+
    1111111011

accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 64, options.seg_size);

  • dev~1
    0010111111
  • docu for vadd example
    0111111111
  • hls code compatibility with Vitis 2023+
    1111111111

The result seems to be dependent on utilization. I am currently almost alone on the cluster and this is the first time I'm getting this amount of successes.

The behaviour I initially observed might just have been due to this effect.

The count was still 24.

I tried to avoid false negatives due to filesystem errors.

The script I used:

for i in {1..10};
do
    echo "6 7" | ./run.sh &> /dev/null
    sleep 20
    grep ".*ACCL base functionality test completed successfully.*" accl_log/rank_0_M_7_N_24_H_1_P_1_stdout | wc;
    if grep -q ".*ERROR: ACCL base functionality test failed.*" accl_log/rank_0_M_7_N_24_H_1_P_1_stdout; then
	echo "ERROR found"
    fi
done

@bo3z
Copy link
Contributor

bo3z commented Jun 17, 2024

Thanks for running these - I don't think this is a utilisation / congestion issue though. The networking stacks ACCL uses (RDMA from Coyote, TCP/IP from EasyNet) both have retransmission if I am not mistaken, so any congestion in the cluster should not cause these issues. Also, such issues are quite low level, so I would expect them to also have an impact on other collectives, not just gather. Could this be some race condition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants