Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does epsilon do in code?It seems to have caused an overflow #2

Open
yao9261 opened this issue Nov 9, 2021 · 5 comments
Open

Comments

@yao9261
Copy link

yao9261 commented Nov 9, 2021

Hi, I am a graduate student and I am very interested in your thesis. I have some difficulties when trying to run the code.

clustering.py line 162-165:

for i in range(n_clusters):
pop_clusters[i, 0] = i + 1
for client in np.where(clusters == i + 1)[0]:
pop_clusters[i, 1] += int(weights[client] * epsilon * n_sampled)

In the process of debugging, I found that some pop_clusters[i, 1] became a negative number after calculate, and I suspected that it might be overflow.
And I don’t understand what is the role of “epsilon” here. Could you help me understand it?

@yao9261
Copy link
Author

yao9261 commented Nov 9, 2021

Labs-Federated-Learning-clustered_sampling\py_func\clustering.py:165: RuntimeWarning:
overflow encountered in long_scalars pop_clusters[i, 1] += int(weights[client] * epsilon * n_sampled)

It did overflow

@YannFra
Copy link
Contributor

YannFra commented Nov 24, 2021

Hi yao9261, Thank you for you interest in this work.

Overflow issue. clustering.py was written for the experimental scenarios discussed in this paper. Is your issue obtained with one of these scenarios or a different one ? Please give me more details about the inputs of get_clusters_with_alg_2 leading to your error ( linkage_matrix, n_sampled, and weights). You are right pop_clusters[i, 1] is supposed to be non-negative.
Also, you can find in tests/test_clustering.py some tests for get_clusters_with_alg_2. They are all successfully passed.

epsilon. You are here discussing the implementation of Algorithm 2. In our work, we consider that the input is {n_i} while in get_clusters_with_alg_2 the input are the clients importance {p_i}, i.e. weights. epsilon is used to convert a client importance into an integer.

@yao9261
Copy link
Author

yao9261 commented Nov 24, 2021

To begin with, thank you for your reply! That means a lot to me.
@YannFra

More details:
My running environment is the same as requirements.txt
I run FL.py by PyCharm with para:
dataset = "MNIST_shard"
sampling = "clustered_2"
sim_type = "cosine"
seed = 0
n_SGD = 10
lr = 0.01
decay = 1.0
p = 0.2
force = "True"
During debugging, some pop_clusters[i, 1] became a negative number.
image
Maybe a smaller epsilon can solve the problem?

But it’s okay, now I understand how it works here. And I think my error may caused by my running environment, Win10 and Pycharm.

I am trying to study how to improve FL with Non-IID data through clustering. Your article really inspired me a lot. Thank you very much for your reply again!

@YannFra
Copy link
Contributor

YannFra commented Nov 25, 2021

I ran FL.py with the parameters you gave and the training went through.
Could you please display the error message you get ?
Is the server able to perform a couple of optimization rounds before you get your error message or is get_clusters_with_alg_2 unable to get clusters from the beginning of FL?

We have not been able to isolate your problem yet but epsilon should not be related to it.
Please let me know if there are any new developments.

Thank you for your positive feedback on this work. Feel free to contact me by e-mail (included in the paper) if you want to discuss the theoretical aspect of this work.

@yao9261
Copy link
Author

yao9261 commented Nov 25, 2021

Overflow will not cause an error. It is just a warning.

Warning messege:
Labs-Federated-Learning-clustered_sampling\py_func\clustering.py:165: RuntimeWarning:
overflow encountered in long_scalars pop_clusters[i, 1] += int(weights[client] * epsilon * n_sampled)

Training can still be done after the overflow occurs, but the sorting and selection of clusters will be meaningless. At this time, cluster sampling cannot accelerate convergence.

When epsilon = 10^10, overflow occurs from the first round.
When epsilon = 10^5, overflow will not occur and accuracy increases faster.

There shouldn't be an overflow problem with int in python 3, and I am also very surprised why it overflows.
But from the debugging results, it is indeed an overflow that caused clusters sorting failure and slower convergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants