-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem trying to run synchrad across multiple nodes #30
Comments
Hi Alberto @delaossa Thanx for your interest in the code! Let me better understand the problem, from what you report follows that it runs with MPI on 4 GPUs -- is it on a local machine of via same slurm submission but on 1 node instead of 2? In your input script are you setting Currently MPI part in Can you provide a bit more details from error and output logs? I'm also a bit curious about your case -- |
Hi Igor, Thank you for the fast response! So, I have tried first without stdout
stderr
with the 4 GPUs running at >99%. Then, I have tried with 2 nodes, 8 GPUs and, while the previous error is gone, there is something else: stdout
stderr
and only one GPU does the job... About my particular study: |
Hello! stdout
no error messages whatsover in stderr
but only the GPUs on the first node are used. |
Well, well, it's working great now with this #28
Thanks Angel Ferran Pousa for spotting this detail. stdout
|
Great that you've figured that out, Alberto @delaossa ! Andrei @berceanu , we should catch up and discuss completion of #28. There are few things to fix (interactive start and cpu handling), and lets merge it asap. Ping me in slack when you have time. |
Thanks! Thanks for the offer, Igor: I'll try to catch you in slack these days so we can discuss about this calculation. |
Hello! I would like to follow up this issue with an update. Last time I reported that
However, something that I didn't notice then became apparent when I increased the number of particles: |
Hi Alberto @delaossa Thanks for reporting this -- it's indeed unexpected. I assume you speak or CPU RAM not the GPU memory? Because GPU memory consumption should be modest in any case as it sends track one by one so each card only needs to hold the field grid. So the first qusetion is are you loading particles into synchrad via h5 track file (e.g. created by If it's the file method it's curious, as it should only read particles assigned to the local process: Lines 212 to 216 in a128c41
If you are giving it a list it might be a bit confusing since it'll take a piece of the list for processing but will still need the whole list allocated for each process. This list-input way is not really made for MPI scaling I guess, but it can probably be improved to. Could you also append an error message for the case which couldn't run? Thanx! |
Hi Igor! |
OK, in this case i'd suggest to make a file and use it as an input The file configuration is not really documented but basically it has two main groups There are some more parameters in synchrad/synchrad/converters.py Lines 102 to 127 in a128c41
I think, you may skip |
Hello @hightower8083 and all,
I have been using Synchrad recently to calculate coherent radiation of a beam through an undulator.
Thank you for the code!
For my study, it has become clear that I need more macro particles to find convergence of the results, but the simulation already takes about 25 hours in a 4 x GPU (A100) node.
It'd be great to be able to run across multiple nodes to use more GPUs and save some time.
However, I failed on my first try and I am not sure why.
In the submission script, I simply increased the number of requested nodes and adjusted the number of mpi process to use.
This is an example with 2 nodes:
The error message is not really helpful to me:
Am I forgetting anything?
Thank you for your help!
The text was updated successfully, but these errors were encountered: