Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New users are not able to join once training started #718

Open
2 of 4 tasks
JulienVig opened this issue Jul 23, 2024 · 0 comments · May be fixed by #775
Open
2 of 4 tasks

New users are not able to join once training started #718

JulienVig opened this issue Jul 23, 2024 · 0 comments · May be fixed by #775
Assignees
Labels
bug Something isn't working decentralized For the decentralized setting federated For the federated setting
Milestone

Comments

@JulienVig
Copy link
Collaborator

JulienVig commented Jul 23, 2024

Currently, participants need to join within a few seconds of each other, otherwise their contribution is dropped out (by the server in federated and by other peers in decentralized).

Federated

  • The first thing to do is to make the server return the current communication round so that peer can get up to date with the network. This is currently not done so users that join a bit too late keep trailing behind and sending outdated updates.
  • If the server receives an outdated contribution, it should send back the latest model so that the user can catch up.

Decentralized

  • A peer joining the network after the training started doesn't update its round to the current round so its contributions are always dropped because outdated.
  • As soon as minReadyPeer peers joined a task, the collaborative training starts and sometimes starts so quickly that there isn't enough time for other peers to join and contribute before the first round is already finished. For example if minReadyPeer = 3, as soon as 3 peers joined the network, they start aggregating their contributions. Even if a 4th peer joins right after the 3rd, its contribution may be dropped because the first 3 peers already passed to a new round. This should be fixed by the previous checkbox, enabling outdated peers to catch up to the latest round but the first round may still finish before every peer that wanted to join could contribute.
    A potential fix is to implement a waiting stage for peers to join a task and communicate their readiness to start training. Concretely, peers click on "Join task", if the training is already going on then they catch up on the current round. Otherwise, they get into the waiting room where they can see the current number of peers also waiting. Once there are more than minReadyPeers in the waiting room, they can press a button "Ready" to communicate that they would like training now. Once all peers pressed "ready", the training starts and future peers can join mid-training without waiting room.
@JulienVig JulienVig added the bug Something isn't working label Jul 23, 2024
@JulienVig JulienVig added this to the v4.0.0 milestone Jul 23, 2024
@JulienVig JulienVig added federated For the federated setting decentralized For the decentralized setting labels Jul 23, 2024
JulienVig added a commit that referenced this issue Aug 20, 2024
@JulienVig JulienVig linked a pull request Sep 11, 2024 that will close this issue
@JulienVig JulienVig self-assigned this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working decentralized For the decentralized setting federated For the federated setting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant