You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.
I've tried some scenarios:
4 nodes: Every time I go over 3 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
3 nodes: Every time I go over 4 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
2 nodes: Every time I go over 6 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
1 node (not distributing): Tested up to 20 ACCL instances, this works with no problem
Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)
The text was updated successfully, but these errors were encountered:
I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.
I've tried some scenarios:
Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)
The text was updated successfully, but these errors were encountered: