Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more client fail #4678

Open
Camille-Molinier opened this issue Dec 11, 2024 · 10 comments
Open

Adding more client fail #4678

Camille-Molinier opened this issue Dec 11, 2024 · 10 comments
Labels
part: misc framework Issue/PR for general applications for Flower framework. state: under review Currently reviewing issue/PR type: bug Something isn't working

Comments

@Camille-Molinier
Copy link

Describe the bug

Hello,

It seeks to run a flower server and some clients on different terminal. I use a FedAvg server with all default parameter and the client implemented on the tutorial in the doc.
If I run two clients, the process goes well, but if I run three clients, the last one fail.

The only way to avoid the error is to set min_availbable_clients or min_fit_clients to three, but now I can't run four clients.

Can someone help me to understand this error ?

Steps/Code to Reproduce

server.py :

MAX_ROUND = 2

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=MAX_ROUND),
    strategy=FedAvg()
)

client.py :

def set_parameters(model, parameters):
    params_dict = zip(model.state_dict().keys(), parameters)
    state_dict = OrderedDict({k:torch.tensor(v) for k, v in params_dict})
    model.load_state_dict(state_dict, strict=True)

    return model


trainloader, valloader, testloader = load_data()

class FlowerClient(fl.client.NumPyClient):
    def __init__(self):
        super().__init__()
        self.net = Net()

    def get_parameters(self, config):
        return [val.cpu().numpy() for _, val in self.net.state_dict().items()]
    
    def fit(self, parameters, config):
        set_parameters(self.net, parameters)
        # torch.save(net.state_dict(), f'{self.save_path}/epoch_{self.epoch}')
        # self.epoch += 1
        train(self.net, trainloader, valloader, 1, DEVICE)

        return self.get_parameters({}), len(trainloader.dataset), {}
    
    def evaluate(self, parameters, config):
        set_parameters(self.net, parameters)
        loss, accuracy = test(self.net, testloader, DEVICE)

        return float(loss), len(testloader.dataset), {'accuracy': accuracy}
    
fl.client.start_client(
    server_address='127.0.0.1:8080',
    client=FlowerClient().to_client()
)

Dependencies :

cffi==1.17.1
click==8.1.7
contourpy==1.3.1
cryptography==42.0.8
cycler==0.12.1
filelock==3.16.1
flwr==1.12.0
fonttools==4.55.3
fsspec==2024.10.0
grpcio==1.64.3
iterators==0.0.2
Jinja2==3.1.4
kiwisolver==1.4.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.3
mdurl==0.1.2
mpmath==1.3.0
networkx==3.4.2
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==24.2
pathspec==0.12.1
pillow==11.0.0
protobuf==4.25.5
pycparser==2.22
pycryptodome==3.21.0
Pygments==2.18.0
pyparsing==3.2.0
python-dateutil==2.9.0.post0
rich==13.9.4
setuptools==75.6.0
shellingham==1.5.4
six==1.17.0
sympy==1.13.1
tomli==2.2.1
tomli_w==1.1.0
torch==2.5.1
torchvision==0.20.1
triton==3.1.0
typer==0.12.5
typing_extensions==4.12.2

Expected Results

I want to add random number of clients on separated terminal

Actual Results

INFO :      
INFO :      Received: evaluate message 22c2f4b1-4f12-419b-8db7-fb40ab0885b3
ERROR :     Client raised an exception.
Traceback (most recent call last):
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/app.py", line 536, in start_client_internal
    reply_message = client_app(message=message, context=context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client_app.py", line 143, in __call__
    return self._call(message, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client_app.py", line 126, in ffn
    out_message = handle_legacy_message_from_msgtype(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/message_handler/message_handler.py", line 136, in handle_legacy_message_from_msgtype
    evaluate_res = maybe_call_evaluate(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client.py", line 275, in maybe_call_evaluate
    return client.evaluate(evaluate_ins)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/numpy_client.py", line 283, in _evaluate
    results = self.numpy_client.evaluate(parameters, ins.config)  # type: ignore
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/client.py", line 36, in evaluate
    loss, accuracy = test(self.net, testloader, DEVICE)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/task.py", line 68, in test
    outputs = net(images)
              ^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/task.py", line 26, in forward
    x = self.pool(F.relu(self.conv1(x)))
                         ^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Traceback (most recent call last):
  File "/home/cmolinier/Dev/flower/client.py", line 40, in <module>
    fl.client.start_client(
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/app.py", line 180, in start_client
    start_client_internal(
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/app.py", line 543, in start_client_internal
    raise ex
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/app.py", line 536, in start_client_internal
    reply_message = client_app(message=message, context=context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client_app.py", line 143, in __call__
    return self._call(message, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client_app.py", line 126, in ffn
    out_message = handle_legacy_message_from_msgtype(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/message_handler/message_handler.py", line 136, in handle_legacy_message_from_msgtype
    evaluate_res = maybe_call_evaluate(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/client.py", line 275, in maybe_call_evaluate
    return client.evaluate(evaluate_ins)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/flwr/client/numpy_client.py", line 283, in _evaluate
    results = self.numpy_client.evaluate(parameters, ins.config)  # type: ignore
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/client.py", line 36, in evaluate
    loss, accuracy = test(self.net, testloader, DEVICE)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/task.py", line 68, in test
    outputs = net(images)
              ^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/task.py", line 26, in forward
    x = self.pool(F.relu(self.conv1(x)))
                         ^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cmolinier/Dev/flower/venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

@Camille-Molinier Camille-Molinier added the type: bug Something isn't working label Dec 11, 2024
@WilliamLindskog
Copy link
Contributor

Hi Camille,

Thanks for raising this. Could you please test this example using the newer flwr verisons, please find example here: https://github.com/adap/flower/tree/main/examples/quickstart-pytorch.

The code that you are using is not longer supported.

@WilliamLindskog WilliamLindskog added state: under review Currently reviewing issue/PR part: misc framework Issue/PR for general applications for Flower framework. labels Dec 11, 2024
@Camille-Molinier
Copy link
Author

Hi,

Thanks for your answer. The code you gave me run perfectly (with python 3.10).
But, I want to run the server and the clients on different terminals, and actually it didn't work (the server_app.py just do nothing if run with python server_app.py).

How can I adapt the pytorch example to run these separately and see if my original bug reproduces ?

@WilliamLindskog
Copy link
Contributor

Hi,

You could test this example: https://github.com/adap/flower/tree/main/examples/app-pytorch.

It depends on the newer Flower versions and does what I think you want to accomplish. Note that you should try to run it in "Deployment Mode" as specified in README. Then, you can also launch this in a real-world setting if that is an end goal. However, it also works just to open up different terminals.

It this solves your problem, please feel free to close this issue.

@Camille-Molinier
Copy link
Author

Thanks for your answer
I've tried this example, and unfortunately it ends with :

WARNING :   UNSUPPORTED FEATURE: The command `flower-server-app` is deprecated and no longer in use. Use the `flwr-serverapp` exclusively instead.

            This is an unsupported feature. It will be removed
            entirely in future versions of Flower.
        
ERROR :     `flower-server-app` used.

And yes, my end goal is to deploy flower for experiments on different machines or different docker containers if you want more context.

@WilliamLindskog
Copy link
Contributor

Hm that is strange. Let me test it later today and get back to you asap.

@WilliamLindskog
Copy link
Contributor

Hi Camille,

I am still working on this, rest assure I did not forget about this issue.

@Camille-Molinier
Copy link
Author

Hi William,

Thanks ! I'm still trying to solve it too, if I make any advance, I will share it.

@WilliamLindskog
Copy link
Contributor

WilliamLindskog commented Dec 19, 2024

Hi Camille,

Thank you for your patience. Here are 2 alternative examples on how to run flwr in different terminals. The examples we can follow are embedded devices but with less steps or quickstart with docker.

Steps (Quick Way)

Ensure that you have flwr CLI is installed on your system, then run flwr new and follow the prompted steps. Don't forget to install the required packages.

Thereafter, you can start the SuperLink:

flower-superlink --insecure

Then, we launch the clients which consists of the SuperNodes and connect them to the SuperLink. The first SuperNode is launched as follows:

flower-supernode --insecure \
                 --superlink="127.0.0.1:9092" \
                 --clientappio-api-address="0.0.0.0:9094" \
                 --node-config="num-partitions=2 partition-id=0"

Then, launch the second one using a different port for the ClientAppIo API address:

flower-supernode --insecure \
                 --superlink="127.0.0.1:9092" \
                 --clientappio-api-address="0.0.0.0:9095" \
                 --node-config="num-partitions=2 partition-id=1"

Then run federation `flwr run . <FEDERATION NAME>. --stream` This command initiates the federated learning run using the configuration specified in your pyproject.toml under the [tool.flwr.federations.embedded-federation](https://github.com/adap/flower/blob/47d7228b04b6c755860bdbe7e9276d5b500fc04a/examples/embedded-devices/pyproject.toml) section. This can be changed to 

```bash
[tool.flwr.federations.local-deployment]
address = "127.0.0.1:9093"
insecure = true

You will need to configure the data pipeline to make sure that it fits this deployment.

Prerequisites (Docker):

  • Ensure that you have flwr CLI is installed on your system and verify that the Docker daemon is running. On Mac or Linux that would be running either docker info or sudo systemctl start docker

Steps (Docker Example)

You can then create a new Flower project tailored for PyTorch:

flwr new quickstart-docker --framework PyTorch --username flower

# and naviate to the project directory 
cd quickstart-docker

What you then want to do is establishing a Docker bridge network named flwr-network to facilitate container communication.

docker network create --driver bridge flwr-network

When this is done, you can deploy the superlink which coordinates communication between the server and clients.

docker run --rm \
  -p 9091:9091 -p 9092:9092 -p 9093:9093 \
  --network flwr-network \
  --name superlink \
  --detach \
  flwr/superlink:1.13.1 \
  --insecure \
  --isolation process

Then, what you want to do is to initiate the supernodes, each representing a client in the federated learning setup. For the first SuperNode:

docker run --rm \
  -p 9094:9094 \
  --network flwr-network \
  --name supernode-1 \
  --detach \
  flwr/supernode:1.13.1 \
  --insecure \
  --superlink superlink:9092 \
  --node-config "partition-id=0 num-partitions=2" \
  --clientappio-api-address 0.0.0.0:9094 \
  --isolation process

For the second SuperNode:

docker run --rm \
  -p 9095:9095 \
  --network flwr-network \
  --name supernode-2 \
  --detach \
  flwr/supernode:1.13.1 \
  --insecure \
  --superlink superlink:9092 \
  --node-config "partition-id=1 num-partitions=2" \
  --clientappio-api-address 0.0.0.0:9095 \
  --isolation process

When that is complete, you can build and run a ServerApp. Create a Dockerfile named serverapp.Dockerfile to define the ServerApp image:

FROM flwr/serverapp:1.13.1
WORKDIR /app
COPY pyproject.toml .
RUN sed -i 's/.*flwr\[simulation\].*//' pyproject.toml \
    && python -m pip install -U --no-cache-dir .
ENTRYPOINT ["flwr-serverapp"]

then build the ServerApp image:

docker build -f serverapp.Dockerfile -t flwr_serverapp:0.0.1 .

Thereafter, run the ServerApp container:

docker run --rm \
  --network flwr-network \
  --name serverapp \
  --detach \
  flwr_serverapp:0.0.1 \
  --insecure \
  --serverappio-api-address superlink:9091

After you have built the ServerApp, you can create a Dockerfile named clientapp.Dockerfile for the ClientApp image:

FROM flwr/clientapp:1.13.1
WORKDIR /app
COPY pyproject.toml .
RUN sed -i 's/.*flwr\[simulation\].*//' pyproject.toml \
    && python -m pip install -U --no-cache-dir .
ENTRYPOINT ["flwr-clientapp"]

Build the ClientApp image:

docker build -f clientapp.Dockerfile -t flwr_clientapp:0.0.1 .

Run the ClientApp containers in other terminals, connecting them to the respective SuperNodes. For the first ClientApp:

docker run --rm \
  --network flwr-network \
  --detach \
  flwr_clientapp:0.0.1 \
  --insecure \
  --clientappio-api-address supernode-1:9094

For the second ClientApp:

docker run --rm \
  --network flwr-network \
  --detach \
  flwr_clientapp:0.0.1 \
  --insecure \
  --clientappio-api-address supernode-2:9095

Now, you want to execute the federated learning run. Add the following configuration to pyproject.toml to define the federation:

[tool.flwr.federations.local-deployment]
address = "127.0.0.1:9093"
insecure = true

Then, initiate the federated learning process:

flwr run . local-deployment --stream

To implement changes, you can update the code accordingly Python files accordingly. Then, rebuild the Docker images and restart the services to apply the changes.

Hope this helps! If so, please go ahead and close this issue.

@WilliamLindskog
Copy link
Contributor

Hi @Camille-Molinier,
Did any of above comments help you?

All the best
William

@Camille-Molinier
Copy link
Author

Hi William,
I'm currently working on your last comment. Look like it work with 2 clients. Now I need to try to scale up.

I'll go back to you when I've completed this.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part: misc framework Issue/PR for general applications for Flower framework. state: under review Currently reviewing issue/PR type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants