You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context - This observation is based on running a simulation of 120 nodes using GRPC for traditional_fl. The problem would not be as bad when each node is only interacting with ~10-20 nodes in any given round.
Issue - Right now the broadcast function is implemented by looping over a send function which is a unicast function. This makes broadcast effectively a serially executed function which reduces its effectiveness.
Solution - While this can be improved by making the send function multi-threaded, I believe a better approach would be to have nodes pull the model updates instead of the super-node pushing it to each node. Even with pull approach, multithreading would be needed to make sure the early nodes wait until the most fresh copy of model weights is available. Furthermore, the server may not respond to the request if too many nodes are already in the request queue so we will have to implement the retry logic. The retry logic is already implemented for register function in https://github.com/aidecentralized/sonar/blob/main/src/utils/communication/grpc/main.py
The text was updated successfully, but these errors were encountered:
Context - This observation is based on running a simulation of 120 nodes using GRPC for
traditional_fl
. The problem would not be as bad when each node is only interacting with ~10-20 nodes in any given round.Issue - Right now the
broadcast
function is implemented by looping over asend
function which is a unicast function. This makes broadcast effectively a serially executed function which reduces its effectiveness.Solution - While this can be improved by making the send function multi-threaded, I believe a better approach would be to have nodes pull the model updates instead of the super-node pushing it to each node. Even with pull approach, multithreading would be needed to make sure the early nodes wait until the most fresh copy of model weights is available. Furthermore, the server may not respond to the request if too many nodes are already in the request queue so we will have to implement the retry logic. The retry logic is already implemented for
register
function in https://github.com/aidecentralized/sonar/blob/main/src/utils/communication/grpc/main.pyThe text was updated successfully, but these errors were encountered: