Problem trying to run synchrad across multiple nodes #30

delaossa · 2024-02-02T10:37:43Z

I have been using Synchrad recently to calculate coherent radiation of a beam through an undulator.
Thank you for the code!

For my study, it has become clear that I need more macro particles to find convergence of the results, but the simulation already takes about 25 hours in a 4 x GPU (A100) node.
It'd be great to be able to run across multiple nodes to use more GPUs and save some time.
However, I failed on my first try and I am not sure why.

In the submission script, I simply increased the number of requested nodes and adjusted the number of mpi process to use.
This is an example with 2 nodes:

#!/bin/bash -x
#SBATCH --job-name=synchrad
#SBATCH --partition=mpa
#SBATCH --nodes=2
#SBATCH --constraint="GPUx4&A100"
#SBATCH --time=48:00:00
#SBATCH --output=stdout
#SBATCH --error=stderr
#SBATCH --mail-type=END

# Activate environment
source $HOME/synchrad_env.sh
export PYOPENCL_CTX=':'

mpirun -n 8 python undulator_beam.py

The error message is not really helpful to me:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 140773 RUNNING AT max-mpag009
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Am I forgetting anything?

Thank you for your help!

The text was updated successfully, but these errors were encountered:

hightower8083 · 2024-02-02T11:56:49Z

Hi Alberto @delaossa

Thanx for your interest in the code!

Let me better understand the problem, from what you report follows that it runs with MPI on 4 GPUs -- is it on a local machine of via same slurm submission but on 1 node instead of 2? In your input script are you setting 'ctx' argument to 'mpi'? Why do you need PYOPENCL_CTX=':' ?

Currently MPI part in synchrad is not very developed, and we need to rework it (there is a #28 but we didn't get it finished yet), and i'm not sure it should work out of box in multi-node case. I it really depends on how SLURM handles openCL platform on the cluster -- it could work if it opens all GPU in the same platform but if each node has a separate platform with 4 devices we need to do some nesting.

Can you provide a bit more details from error and output logs?

I'm also a bit curious about your case -- 25h x 4 x A100 seems big even for coherent case. Physically for the coherent calculations you need one macro-particle per electron and for real beams this is typically too much, so for the coherent calculations I usually look at features qualitatively and only take as much particles needed to get low shot-noise.

delaossa · 2024-02-02T13:13:08Z

Hi Igor,

Thank you for the fast response!
Yes, I run in one node with 4 GPUs with no problems.
I thought that I needed to set PYOPENCL_CTX=':' to get all the GPUs running and avoid being asked which GPU to select.
But now I understand that this is unnecessary if one uses mpirun and the 'ctx' argument to 'mpi'.

So, I have tried first without PYOPENCL_CTX=':' for the 1 node, 4 GPU case and it works good as before:

stdout

Running on 4 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB

stderr

mpirun -n 4 python undulator_beam.py
100%|██████████| 250/250 [00:46<00:00,  5.33it/s]
100%|██████████| 250/250 [00:46<00:00,  5.35it/s]

with the 4 GPUs running at >99%.

Then, I have tried with 2 nodes, 8 GPUs and, while the previous error is gone, there is something else:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  Starting without device:
  Starting without device:
  Starting without device:
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

stderr

mpirun -n 8 python undulator_beam.py
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
100%|██████████| 125/125 [01:09<00:00,  1.79it/s]

and only one GPU does the job...

About my particular study:
I get the trajectories of 1e6 particles through a 50 period undulator with 64 steps per oscillation.

delaossa · 2024-02-02T13:40:39Z

Hello!
I have tried @berceanu's branch #28 and the situation improves:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2 
Separate it_range for each track will be used
Spectrum is saved to spec_data/spectrum_incoh.h5
Separate it_range for each track will be used
Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 1]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 5]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 6]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 2]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 3]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 4]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 7]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Spectrum is saved to spec_data/spectrum_coh.h5

no error messages whatsover in stderr

mpirun -n 8 python undulator_beam.py
100%|██████████| 125/125 [01:09<00:00,  1.81it/s]
100%|██████████| 125/125 [01:08<00:00,  1.82it/s]

but only the GPUs on the first node are used.
In stdout I see 8 lines like this one Creating context with args: {'answers': [0, 7]} with the first index always 0 and the second going from 0 to 7.
I don't know how this matches your expectations, but it seems to me that 8 mpi processes are created but they are using only the 4 GPUs of the first node.

delaossa · 2024-02-02T15:14:16Z

Well, well, it's working great now with this #28
I just needed to add -ppn 4 to mpirun so it is cleat that there are 4 processes per node.

mpirun -n 8 -ppn 4 python undulator_beam.py

Thanks Angel Ferran Pousa for spotting this detail.
And thank you Igor @hightower8083 for the code and the support.
I'd love to talk with you about the particular study that I am dealing with.

stdout

Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

hightower8083 · 2024-02-02T21:10:31Z

Great that you've figured that out, Alberto @delaossa !
Maybe the number of processes per node can also be fixed globally in the partition settings, so it will always work correctly.
I am not familiar with the flags -btype flattop --hghg -- are these also necessary for the correct work?
I'd be interested to see if this big calculation works out and could share tips on the code use if necessary =) ping me in fbpic slack and we could chat there.

Andrei @berceanu , we should catch up and discuss completion of #28. There are few things to fix (interactive start and cpu handling), and lets merge it asap. Ping me in slack when you have time.

delaossa · 2024-02-05T11:17:26Z

Thanks!
The -btype flattop --hghg flags are arguments for undulator_beam.py.
I will delete them from above to avoid confusion.

Thanks for the offer, Igor: I'll try to catch you in slack these days so we can discuss about this calculation.

delaossa · 2024-03-06T11:40:16Z

Hello! I would like to follow up this issue with an update.

Last time I reported that synchrad run well across multiple nodes (in the DESY Maxwell cluster) when using the -ppn flag, e.g.

mpirun -n 8 -ppn 4 python undulator_beam.py

However, something that I didn't notice then became apparent when I increased the number of particles:
The total memory allocation increases times the number of processes.
So, although the processing time is reduced by this factor (the number of processes), the memory allocation increases by the same factor, which makes easy to run out of memory for a high number of particles.
For example, the simulation that i was running couldn't go up to 2M particles.

hightower8083 · 2024-03-06T14:37:55Z

Hi Alberto @delaossa

Thanks for reporting this -- it's indeed unexpected. I assume you speak or CPU RAM not the GPU memory? Because GPU memory consumption should be modest in any case as it sends track one by one so each card only needs to hold the field grid.

So the first qusetion is are you loading particles into synchrad via h5 track file (e.g. created by tracksFromOPMD), or you give it as a list?

If it's the file method it's curious, as it should only read particles assigned to the local process:

synchrad/synchrad/calc.py

Lines 212 to 216 in a128c41

    
           part_ind = np.arange(Np)[self.rank::self.size] 
        
           for ip in part_ind: 
        
               track = [f_tracks[f"tracks/{ip:d}/{cmp}"][()] for cmp in cmps] 
        
               particleTracks.append(track)

If you are giving it a list it might be a bit confusing since it'll take a piece of the list for processing but will still need the whole list allocated for each process. This list-input way is not really made for MPI scaling I guess, but it can probably be improved to.

Could you also append an error message for the case which couldn't run?

Thanx!

delaossa · 2024-03-06T17:29:03Z

Hi Igor!
As you guessed, I pass the tracks as a list to Synchrad.
And yes, it is the CPU RAM the one going over the top.
Thank you!

hightower8083 · 2024-03-06T19:31:26Z

OK, in this case i'd suggest to make a file and use it as an input file_tracks=.

The file configuration is not really documented but basically it has two main groups tracks and misc, where tracks have groups for each particle and each has standard coordinates set, i.e. tracks/particle_number/record, where particle_number is an integer and record is x, y,z,ux, uy, uz, w, and it_start, where where coordinates are 1d arrays and w is a float number of physical electrons, and it_start you can set to 0 if tracks have same time sampling.

There are some more parameters in misc group, and I suggest you to check how this is organized in one of converters synchrad has, e.g. here

synchrad/synchrad/converters.py

Lines 102 to 127 in a128c41

    
                       f[f'tracks/{i_tr:d}/x'] = x 
        
                       f[f'tracks/{i_tr:d}/y'] = y 
        
                       if z_is_xi: 
        
                           f[f'tracks/{i_tr:d}/z'] = z + c * t 
        
                       else: 
        
                           f[f'tracks/{i_tr:d}/z'] = z 
        
                       f[f'tracks/{i_tr:d}/ux'] = ux 
        
                       f[f'tracks/{i_tr:d}/uy'] = uy 
        
                       f[f'tracks/{i_tr:d}/uz'] = uz 
        
                       f[f'tracks/{i_tr:d}/w'] = w 
        
                       f[f'tracks/{i_tr:d}/it_start'] = it_start 
        
                       it_end_local = it_start + nsteps 
        
                       if it_end_global < it_end_local: 
        
                           it_end_global = it_end_local 
        
                       if it_start_global>it_start: 
        
                           it_start_global = it_start 
        
                       i_tr += 1 
        
           f['misc/cdt'] = cdt 
        
           f['misc/cdt_array'] = cdt_array 
        
           f['misc/N_particles'] = i_tr 
        
           f['misc/it_range'] = np.array([it_start_global, it_end_global]) 
        
           f['misc/propagation_direction'] = 'z'

I think, you may skip cdt_array, it_range and propagation_direction keys as they are currently not used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem trying to run synchrad across multiple nodes #30

Problem trying to run synchrad across multiple nodes #30

delaossa commented Feb 2, 2024

hightower8083 commented Feb 2, 2024

delaossa commented Feb 2, 2024 •

edited

Loading

delaossa commented Feb 2, 2024 •

edited

Loading

delaossa commented Feb 2, 2024 •

edited

Loading

hightower8083 commented Feb 2, 2024

delaossa commented Feb 5, 2024

delaossa commented Mar 6, 2024 •

edited

Loading

hightower8083 commented Mar 6, 2024

delaossa commented Mar 6, 2024

hightower8083 commented Mar 6, 2024

Problem trying to run synchrad across multiple nodes #30

Problem trying to run synchrad across multiple nodes #30

Comments

delaossa commented Feb 2, 2024

hightower8083 commented Feb 2, 2024

delaossa commented Feb 2, 2024 • edited Loading

delaossa commented Feb 2, 2024 • edited Loading

delaossa commented Feb 2, 2024 • edited Loading

hightower8083 commented Feb 2, 2024

delaossa commented Feb 5, 2024

delaossa commented Mar 6, 2024 • edited Loading

hightower8083 commented Mar 6, 2024

delaossa commented Mar 6, 2024

hightower8083 commented Mar 6, 2024

delaossa commented Feb 2, 2024 •

edited

Loading

delaossa commented Feb 2, 2024 •

edited

Loading

delaossa commented Feb 2, 2024 •

edited

Loading

delaossa commented Mar 6, 2024 •

edited

Loading