Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train stops when a client fails #4346

Open
oabuhamdan opened this issue Oct 21, 2024 · 1 comment
Open

Train stops when a client fails #4346

oabuhamdan opened this issue Oct 21, 2024 · 1 comment
Labels
bug Something isn't working part: examples Add or update a Flower example

Comments

@oabuhamdan
Copy link

Describe the bug

When a round encounters failures because of Grpc-Bridge is closed for one of the clients, the whole training stops.
First, it wasn't doing evaluation after fitting. Thus, I disabled evaluation.
Now if the first round has failures, the second round doesn't start!

Steps/Code to Reproduce

I am using the code example here
https://flower.ai/docs/examples/embedded-devices.html
Most of the time, my topo work, but when the GRPC bridge close (not sure why), the training stops.

Expected Results

The training should continue, ignoring the failed devices when accept_failures is True. Or, the server should try to crete a new GRPC connection (bridge).

Actual Results

The training doesn't continue when accept_failures is True.

@oabuhamdan oabuhamdan added the bug Something isn't working label Oct 21, 2024
@jafermarq jafermarq added the part: examples Add or update a Flower example label Oct 21, 2024
@jafermarq
Copy link
Contributor

Hey @oabuhamdan , thanks for opening this issue. That example still needs to be updated to the new way of using Flower (i.e. via flwr run). Most of the other examples have been updated, see for example the recently-updated https://github.com/adap/flower/tree/main/examples/flower-authentication.

We plan to update the embedded devices example by the end of the week. Are you interested in starting this effort? Do you have some bandwidth? The steps aren't too complex or different from other examples using the Deployment Engine (as the authentication example does). For this we just need:

  • Indicate users how to launch a SuperLink and SuperExec in a workstation or laptop.
  • Then indicate the flower-supernode needs to be executed in the RPis
  • With the above in place and in an "idling" state, someone will do flwr run . and this will start the Run.

I'd suggest first starting with the bare minimum and then add more features (like the SSL certificates). Using docker is not needed. We can first focus on RPi devices (later verify things work on Jetsons. Feel free to change the models and datasets that are used in the example (but let's keep the workload lightweight if possible).

Let me know if you'd like to start this effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working part: examples Add or update a Flower example
Projects
None yet
Development

No branches or pull requests

2 participants