Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem trying to run synchrad across multiple nodes #30

Open
delaossa opened this issue Feb 2, 2024 · 10 comments
Open

Problem trying to run synchrad across multiple nodes #30

delaossa opened this issue Feb 2, 2024 · 10 comments

Comments

@delaossa
Copy link

delaossa commented Feb 2, 2024

Hello @hightower8083 and all,

I have been using Synchrad recently to calculate coherent radiation of a beam through an undulator.
Thank you for the code!

For my study, it has become clear that I need more macro particles to find convergence of the results, but the simulation already takes about 25 hours in a 4 x GPU (A100) node.
It'd be great to be able to run across multiple nodes to use more GPUs and save some time.
However, I failed on my first try and I am not sure why.

In the submission script, I simply increased the number of requested nodes and adjusted the number of mpi process to use.
This is an example with 2 nodes:

#!/bin/bash -x
#SBATCH --job-name=synchrad
#SBATCH --partition=mpa
#SBATCH --nodes=2
#SBATCH --constraint="GPUx4&A100"
#SBATCH --time=48:00:00
#SBATCH --output=stdout
#SBATCH --error=stderr
#SBATCH --mail-type=END

# Activate environment
source $HOME/synchrad_env.sh
export PYOPENCL_CTX=':'

mpirun -n 8 python undulator_beam.py

The error message is not really helpful to me:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 140773 RUNNING AT max-mpag009
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Am I forgetting anything?

Thank you for your help!

@hightower8083
Copy link
Owner

Hi Alberto @delaossa

Thanx for your interest in the code!

Let me better understand the problem, from what you report follows that it runs with MPI on 4 GPUs -- is it on a local machine of via same slurm submission but on 1 node instead of 2? In your input script are you setting 'ctx' argument to 'mpi'? Why do you need PYOPENCL_CTX=':' ?

Currently MPI part in synchrad is not very developed, and we need to rework it (there is a #28 but we didn't get it finished yet), and i'm not sure it should work out of box in multi-node case. I it really depends on how SLURM handles openCL platform on the cluster -- it could work if it opens all GPU in the same platform but if each node has a separate platform with 4 devices we need to do some nesting.

Can you provide a bit more details from error and output logs?

I'm also a bit curious about your case -- 25h x 4 x A100 seems big even for coherent case. Physically for the coherent calculations you need one macro-particle per electron and for real beams this is typically too much, so for the coherent calculations I usually look at features qualitatively and only take as much particles needed to get low shot-noise.

@delaossa
Copy link
Author

delaossa commented Feb 2, 2024

Hi Igor,

Thank you for the fast response!
Yes, I run in one node with 4 GPUs with no problems.
I thought that I needed to set PYOPENCL_CTX=':' to get all the GPUs running and avoid being asked which GPU to select.
But now I understand that this is unnecessary if one uses mpirun and the 'ctx' argument to 'mpi'.

So, I have tried first without PYOPENCL_CTX=':' for the 1 node, 4 GPU case and it works good as before:

stdout

Running on 4 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB

stderr

mpirun -n 4 python undulator_beam.py
100%|██████████| 250/250 [00:46<00:00,  5.33it/s]
100%|██████████| 250/250 [00:46<00:00,  5.35it/s]

with the 4 GPUs running at >99%.

Then, I have tried with 2 nodes, 8 GPUs and, while the previous error is gone, there is something else:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  Starting without device:
  Starting without device:
  Starting without device:
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

stderr

mpirun -n 8 python undulator_beam.py
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
100%|██████████| 125/125 [01:09<00:00,  1.79it/s]

and only one GPU does the job...

About my particular study:
I get the trajectories of 1e6 particles through a 50 period undulator with 64 steps per oscillation.

@delaossa
Copy link
Author

delaossa commented Feb 2, 2024

Hello!
I have tried @berceanu's branch #28 and the situation improves:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2 
Separate it_range for each track will be used
Spectrum is saved to spec_data/spectrum_incoh.h5
Separate it_range for each track will be used
Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 1]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 5]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 6]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 2]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 3]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 4]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 7]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Spectrum is saved to spec_data/spectrum_coh.h5

no error messages whatsover in stderr

mpirun -n 8 python undulator_beam.py
100%|██████████| 125/125 [01:09<00:00,  1.81it/s]
100%|██████████| 125/125 [01:08<00:00,  1.82it/s]

but only the GPUs on the first node are used.
In stdout I see 8 lines like this one Creating context with args: {'answers': [0, 7]} with the first index always 0 and the second going from 0 to 7.
I don't know how this matches your expectations, but it seems to me that 8 mpi processes are created but they are using only the 4 GPUs of the first node.

@delaossa
Copy link
Author

delaossa commented Feb 2, 2024

Well, well, it's working great now with this #28
I just needed to add -ppn 4 to mpirun so it is cleat that there are 4 processes per node.

mpirun -n 8 -ppn 4 python undulator_beam.py

Thanks Angel Ferran Pousa for spotting this detail.
And thank you Igor @hightower8083 for the code and the support.
I'd love to talk with you about the particular study that I am dealing with.

stdout

Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

@hightower8083
Copy link
Owner

Great that you've figured that out, Alberto @delaossa !
Maybe the number of processes per node can also be fixed globally in the partition settings, so it will always work correctly.
I am not familiar with the flags -btype flattop --hghg -- are these also necessary for the correct work?
I'd be interested to see if this big calculation works out and could share tips on the code use if necessary =) ping me in fbpic slack and we could chat there.

Andrei @berceanu , we should catch up and discuss completion of #28. There are few things to fix (interactive start and cpu handling), and lets merge it asap. Ping me in slack when you have time.

@delaossa
Copy link
Author

delaossa commented Feb 5, 2024

Thanks!
The -btype flattop --hghg flags are arguments for undulator_beam.py.
I will delete them from above to avoid confusion.

Thanks for the offer, Igor: I'll try to catch you in slack these days so we can discuss about this calculation.

@delaossa
Copy link
Author

delaossa commented Mar 6, 2024

Hello! I would like to follow up this issue with an update.

Last time I reported that synchrad run well across multiple nodes (in the DESY Maxwell cluster) when using the -ppn flag, e.g.

mpirun -n 8 -ppn 4 python undulator_beam.py

However, something that I didn't notice then became apparent when I increased the number of particles:
The total memory allocation increases times the number of processes.
So, although the processing time is reduced by this factor (the number of processes), the memory allocation increases by the same factor, which makes easy to run out of memory for a high number of particles.
For example, the simulation that i was running couldn't go up to 2M particles.

@hightower8083
Copy link
Owner

Hi Alberto @delaossa

Thanks for reporting this -- it's indeed unexpected. I assume you speak or CPU RAM not the GPU memory? Because GPU memory consumption should be modest in any case as it sends track one by one so each card only needs to hold the field grid.

So the first qusetion is are you loading particles into synchrad via h5 track file (e.g. created by tracksFromOPMD), or you give it as a list?

If it's the file method it's curious, as it should only read particles assigned to the local process:

synchrad/synchrad/calc.py

Lines 212 to 216 in a128c41

part_ind = np.arange(Np)[self.rank::self.size]
for ip in part_ind:
track = [f_tracks[f"tracks/{ip:d}/{cmp}"][()] for cmp in cmps]
particleTracks.append(track)

If you are giving it a list it might be a bit confusing since it'll take a piece of the list for processing but will still need the whole list allocated for each process. This list-input way is not really made for MPI scaling I guess, but it can probably be improved to.

Could you also append an error message for the case which couldn't run?

Thanx!

@delaossa
Copy link
Author

delaossa commented Mar 6, 2024

Hi Igor!
As you guessed, I pass the tracks as a list to Synchrad.
And yes, it is the CPU RAM the one going over the top.
Thank you!

@hightower8083
Copy link
Owner

OK, in this case i'd suggest to make a file and use it as an input file_tracks=.

The file configuration is not really documented but basically it has two main groups tracks and misc, where tracks have groups for each particle and each has standard coordinates set, i.e. tracks/particle_number/record, where particle_number is an integer and record is x, y,z,ux, uy, uz, w, and it_start, where where coordinates are 1d arrays and w is a float number of physical electrons, and it_start you can set to 0 if tracks have same time sampling.

There are some more parameters in misc group, and I suggest you to check how this is organized in one of converters synchrad has, e.g. here

f[f'tracks/{i_tr:d}/x'] = x
f[f'tracks/{i_tr:d}/y'] = y
if z_is_xi:
f[f'tracks/{i_tr:d}/z'] = z + c * t
else:
f[f'tracks/{i_tr:d}/z'] = z
f[f'tracks/{i_tr:d}/ux'] = ux
f[f'tracks/{i_tr:d}/uy'] = uy
f[f'tracks/{i_tr:d}/uz'] = uz
f[f'tracks/{i_tr:d}/w'] = w
f[f'tracks/{i_tr:d}/it_start'] = it_start
it_end_local = it_start + nsteps
if it_end_global < it_end_local:
it_end_global = it_end_local
if it_start_global>it_start:
it_start_global = it_start
i_tr += 1
f['misc/cdt'] = cdt
f['misc/cdt_array'] = cdt_array
f['misc/N_particles'] = i_tr
f['misc/it_range'] = np.array([it_start_global, it_end_global])
f['misc/propagation_direction'] = 'z'

I think, you may skip cdt_array, it_range and propagation_direction keys as they are currently not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants