Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reconnect Supernode to Superlink when using TLS #4844

Open
d0uwe opened this issue Jan 21, 2025 · 4 comments
Open

Cannot reconnect Supernode to Superlink when using TLS #4844

d0uwe opened this issue Jan 21, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@d0uwe
Copy link

d0uwe commented Jan 21, 2025

Describe the bug

I am testing with a remote superlink, with TLS and authentication enabled. When I connect a supernode to the supernode, all works well. When I stop the supernode, the superlink shows a log entry that the node is deleted. When I start the supernode again, I receive the below error, however, it does connect to the superlink to some degree, as it does show regular pings afterwards. When I submit an experiment to the Superlink, this experiment will never start on that supernode that received the below error. I received the below error message on both my laptop (mac) and remote machine (linux) when trying to reconnect the supernode.

Restarting the superlink is a solution, but then all supernodes need to be reconnected too, which can be inconvenient.

I'm using flwr 1.14.0 on all systems.

Steps/Code to Reproduce

Step 1: Start superlink, with root certificate and authentication keys
Step 2: Start supernode to superlink that connects correctly
Step 3: Stop the supernode (ctrl + c)
Step 4: Wait anywhere between some seconds to 30 minutes
Step 5: Start the supernode again
Step 6: the error appears.

Expected Results

The supernode succesfully connects to the superlink again

Actual Results

flower-supernode \
    --root-certificates certificates/ca.crt \
    --superlink your_url:9092 \
    --clientappio-api-address 0.0.0.0:9099 \
    --node-config="partition-id=1 num-partitions=2" \
    --auth-supernode-private-key keys/client_credentials_2 \
    --auth-supernode-public-key keys/client_credentials_2.pub
INFO :      Starting Flower SuperNode
INFO :      Starting Flower ClientAppIo gRPC server on 0.0.0.0:9099
Exception in thread Thread-6 (_ping_loop):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "xxx/lib/python3.11/site-packages/flwr/client/heartbeat.py", line 57, in _ping_loop
    retrier.invoke(wrapped_ping)
  File "xxx/lib/python3.11/site-packages/flwr/common/retry_invoker.py", line 276, in invoke
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "xxx/lib/python3.11/site-packages/flwr/client/heartbeat.py", line 46, in wrapped_ping
    ping_fn()
  File "xxx/lib/python3.11/site-packages/flwr/client/grpc_rere_client/connection.py", line 187, in ping
    raise RuntimeError("Ping failed unexpectedly.")
RuntimeError: Ping failed unexpectedly.
@d0uwe d0uwe added the bug Something isn't working label Jan 21, 2025
@d0uwe d0uwe changed the title Cannot reconnect Supernode to Superlink Cannot reconnect Supernode to Superlink when using TLS Jan 28, 2025
@danieljanes
Copy link
Member

Thanks for reporting this @d0uwe, we're looking into it

@danieljanes
Copy link
Member

@d0uwe we found the issue that caused this in Flower 1.14 and are happy to report that it's already fixed on main.

Would you mind testing this with flwr-nightly and reporting back?

@d0uwe
Copy link
Author

d0uwe commented Jan 30, 2025

Hi @danieljanes,

Thanks for working on this!
I just checked this, but still run into some issues. If I use flwr 1.14, using these commands all works well on a remote server hosting the superlink and running the supernode on my laptop:

flower-superlink \
     --ssl-ca-certfile certificates/ca.crt \
     --ssl-certfile certificates/server.pem \
     --ssl-keyfile certificates/server.key \
     --auth-list-public-keys keys/client_public_keys.csv \
     --auth-superlink-private-key keys/server_credentials \
     --auth-superlink-public-key keys/server_credentials.pub

flower-supernode \
    --root-certificates certificates/ca.crt \
    --superlink hereditary.soil.surf.nl:9092 \
    --clientappio-api-address 0.0.0.0:9094 \
    --node-config="partition-id=0 num-partitions=2" \
    --auth-supernode-private-key keys/client_credentials_1 \
    --auth-supernode-public-key keys/client_credentials_1.pub

When I only swap the virtual environment to one where I have installed flower nightly('1.15.0.dev20250129') (on both the server and my laptop), I get the following output on the superlink:

INFO :      Starting Flower SuperLink
INFO :      Flower Deployment Engine: Starting Exec API on 0.0.0.0:9093
INFO :      Flower ECE: Starting ServerAppIo API (gRPC-rere) on 0.0.0.0:9091
WARNING :   The `--auth-superlink-private-key` and `--auth-superlink-public-key` arguments are deprecated and will be removed in a future release. Node authentication no longer requires these arguments.
INFO :      Node authentication enabled with 20 known public keys
INFO :      Flower ECE: Starting Fleet API (gRPC-rere) on 0.0.0.0:9092
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1738276287.146781 1197763 ssl_transport_security.cc:2140] No match found for server name: <ip of the server>
E0000 00:00:1738276289.461681 1197762 ssl_transport_security.cc:2140] No match found for server name: <ip of the server>
E0000 00:00:1738276291.774890 1197765 ssl_transport_security.cc:2140] No match found for server name: <ip of the server>

The No match found for server name errors only showed up once during testing, I'm not quite sure what triggered them.

And the following error on the supernode:

INFO :      Starting Flower SuperNode
INFO :      Starting Flower ClientAppIo gRPC server on 0.0.0.0:9094
WARNING :   Can't get server public key, SuperLink may be offline
ERROR :     <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAUTHENTICATED
        details = "Missing authentication metadata"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2025-01-30T23:33:20.833794+01:00", grpc_status:16, grpc_message:"Missing authentication metadata"}"
>
INFO :      Disconnect and shut down

I also tried resolving the warning regarding the auth-superlink-private-key and public-key, but this didn't change things. Hope this helps!

@EndingCredits
Copy link

EndingCredits commented Feb 14, 2025

N.B: I am getting the same client error with flwr/supernode:nightly docker image (though 1.14.0 is then installed with pip) with the --insecure flag set in superlink and supernode run command.

(Supernode command is flwr-supernode:latest --superlink="<machine-ip>:9092" --insecure --node-config="site='site1'" --isolation=subprocess run inside the docker, with running the superlink as flower-superlink --insecure froma venv with whatever is the latest flower)

Downgrading the superlink to flwr==1.14.0 seems to fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants