Joss paper of jaxDecomp #20

ASKabalan · 2024-06-27T14:00:21Z

Adding a draft of JOSS paper

EiffL · 2024-07-07T21:57:44Z

Lol, @ASKabalan ^^ so many force pushes, can we forbid the force pushes from now on?

Co-authored-by: Wassim KABALAN <[email protected]>

EiffL · 2024-07-09T04:16:56Z

Thanks @ASKabalan for the draft, it's a very good start, I have some high level comments that I will add here, and maybe make some particular comments on the markdown file.

My main overaching comment is that this is not a jaxpm paper, it's a jaxDecomp paper. PM simulations are just one potential example of real world application, but not the only raison d'etre of the library.

Motivation: Currently you open the abstract with cosmological simulations, but that is not the right level for this paper. This is a software paper for a distribution library. So I think the story should be different. You can for instance take a look at how mpi4jax structured their abstract: https://joss.theoj.org/papers/10.21105/joss.03419

I think our story here in the abstract could be the following:

JAX has been a powerful tool for scientific computations, not just machine learning (cite e.g. jax-cosmo ;-) or jax-md)
Until very recently general distributed computing (multinode) was not easy in JAX, which hinders the applicability of the framework for HPC tasks.
Some solutions have been proposed in the past for SPMD, in particular mpi4jax. However mpi4jax has limitations, it is not compatible with JAX array distribution logic, limited to "small" messages of less than 2GB.
Over the last year, a huge amount of progress has happened in JAX regarding its native support for SPMD through the unified jax.Array API and the merge of jit and pjit.
However, not all native JAX operations have a specialized distribution strategy, and so pjitting a program can currently lead to more communication than necessary for some operations. And in particular the key operation we are concerned with is 3D FFT.
To alleviate these limitations, we introduce jaxDecomp, a jax wrapper for the cuDecomp domain decompositiopn library, which provides jax primitives with highly efficient CUDA implementations for key operations needed for HPC simulation tasks, namely 3D FFTs and halo exchanges.
Being implemented as jax primitives, jaxDecomp directly builds on top of the distributed Array strategy in JAX, and is compatible with jax transformation such as jax.grad and jax.jit
Through cuDecomp, jaxDecomp provides lowlevel NCCL, CUDA-Aware MPI, and NVSHMEM backends for distributed array transpose operations.

Statement of Need Here the main point is that we should motivate why native jax distribution might not be enough. We can say that for numerical simulations on HPC systems, we would like to allow for peak performance, and performance is bottlenecked by inter-gpu communications. While it's technically possible to write for instance a distributed FFT in native JAX , here our aim is to go for unbeatable performance using a highly optimized and dedicated CUDA library as the backend.
- Pour faire les choses bien, here we might want to compare in the benchmark the perfomance of a simple distributed FFT op in JAX. By that I mean something like this:

# Create an array
x = jax.random.normal(jax.random.key(0), (32, 32, 32))

# Distributes the array
sharding = PositionalSharding(mesh_utils.create_device_mesh((2,2,1)))
x = jax.device_put(x, sharding)

# Perform 1D FFT along the last dimension and transpose the array
x = jnp.fft.fft(x).transpose(2,0,1)  # [z', x, y]
x = jnp.fft.fft(x).transpose(2,0,1)  # [y', z', x]
x = jnp.fft.fft(x)                              # [y', z', x']

If we have such a comparison in the benchmark, we can refer to it here as a statement of need.

Then at the end of the statement, we can mention a real world application, and that's where we can talk about PM simulations for cosmology. We can in particular mention FlowPM (distributed but in TF, so useless) and pmwd (not distributed and so limited to 512 volumes).

Implementation:
1. I think here we want to start by explaining how we build the wrapper around the cuDecomp operations. So, mention that we use the custom_op tool, maybe mention something about the strategy you have built to preserve the state of cudecomp between kernel calls.
2. We want to explain the concept of domain decomposition, explain the pencils and slabs decompositions supported by the library, and explain how one would build a distributed domain in JAX.
3. Once we have explained the above, we can go more into a description of the key operations, 3D FFTs and halo exchange.

You can add a couple of lines of code to illustrate the API for points 2 and 3 above.

Example of Application: Here you can talk about LPT simulations, and link to the example script. You can give a little bit of context for what these simulations are, then you can for instance illustrate how one would compute the gravitational potential from a density field with the jaxdecomp library with some actual code. Something like:

def potential(delta):
  delta_k = pfft3d(delta)  
  kvec = ... 
  laplace_kernel = 1/kk
  potential_k = delta_k * laplace_kernel
  return ipfft3d(potential_k)

Benchmark: Here I think we could change the legend of the plots. For "JAX" what I think you mean is "single GPU jnp.fft.fftn" ? For the other lines, it's not clear how many GPUs you have used. And as mentioned earlier, we should include a comparison to a native jax distributed FFT built from transpose and fft 1d operations... And cross fingers that this doesn't beat us ^^
We should also include a point where we reach several nodes, to see how the performance scales between nodes. It will be a function of the interconnect.

EiffL · 2024-07-09T04:22:17Z

joss-paper/paper.md

+
+## Distributed Halo Exchange
+
+In a particle mesh simulation, we use the 3DFFT to estimate the force field acting on the particles. The force field is then interpolated to the particles, and the particles are moved accordingly. The particles that are close to the boundary of the local domain need to be updated using the data from the neighboring domains. This is done using a halo exchange operation. Where we pad each slice of the simulation then we perform a halo exchange operation to update the particles that are close to the boundary of the local domain.


We shouldn't motivate the halo exchange from the pm simulation, halo exchanges are very common operations for distributed computing: https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture25.pdf
(first result on google)

So here I think we just want to explain that cuDecomp allows for the exchange of border regions, which is a pattern necessary to handle border crossing in many types of simulations.

EiffL · 2024-07-09T04:29:17Z

Taking a careful look a this figure, I think their might be a problem on the right:

not the right colors....

But maybe also we don't want to include the reduction, because it's not clear what that means... On this figure, it looks like you are replacing the border region with the one from the neighbording slice. In general, what one does with the halo region depends on the simulation one runs. So maybe no need to include it here in this figure

ASKabalan · 2024-07-09T08:46:06Z

What's wrong with the colors?

EiffL · 2024-07-09T16:58:51Z

lol, tried to better higlight one of the issues:

ASKabalan · 2024-07-09T17:20:37Z

Oh ..
I am going to remove the reduction step anyway.. it is confusing

…ss-paper Proposal for simplified demo

EiffL · 2024-07-09T20:23:20Z

note, I found this previous implementation I had made using xmap of 3d distributed fft
https://github.com/DifferentiableUniverseInitiative/JaxPM/blob/main/dev/test_pfft.py

@partial(xmap,
         in_axes={  0: 'x', 1: 'y' },
         out_axes=['x', 'y', ...],
         axis_resources={  'x': 'nx',  'y': 'ny' })
@jax.jit
def pfft3d(mesh):
    # [x, y, z]
    mesh = jnp.fft.fft(mesh)  # Transform on z
    mesh = lax.all_to_all(mesh, 'x', 0, 0)  # Now x is exposed, [z,y,x]
    mesh = jnp.fft.fft(mesh)  # Transform on x
    mesh = lax.all_to_all(mesh, 'y', 0, 0)  # Now y is exposed, [z,x,y]
    mesh = jnp.fft.fft(mesh)  # Transform on y
    # [z, x, y]
    return mesh

@partial(xmap,
         in_axes={  0: 'x',  1: 'y' },
         out_axes=['x', 'y', ...],
         axis_resources={  'x': 'nx',  'y': 'ny' })
@jax.jit
def pifft3d(mesh):
    # [z, x, y]
    mesh = jnp.fft.ifft(mesh)  # Transform on y
    mesh = lax.all_to_all(mesh, 'y', 0, 0)  # Now x is exposed, [z,y,x]
    mesh = jnp.fft.ifft(mesh)  # Transform on x
    mesh = lax.all_to_all(mesh, 'x', 0, 0)  # Now z is exposed, [x,y,z]
    mesh = jnp.fft.ifft(mesh)  # Transform on z
    # [x, y, z]
    return mesh

something like this, but using shard_map is probably what we want to benchmark jaxDecomp against

EiffL · 2024-07-14T11:26:20Z

Could you push the benchmark scripts @ASKabalan when you are back? Curious to see if we can gain a bit performance

ASKabalan · 2024-07-14T11:43:52Z

@EiffL
The benchmarks are now on github
JAX : https://github.com/ASKabalan/jaxdecomp-benchmarks/blob/main/scripts/jaxfft.py
JAXDECOMP : https://github.com/ASKabalan/jaxdecomp-benchmarks/blob/main/scripts/pfft3d.py
MPI4JAX : https://github.com/ASKabalan/jaxdecomp-benchmarks/blob/main/scripts/mpi4jaxfft.py

I am trying to make MPI4JAX work
Do you want me to put them in the main repo?

ASKabalan added 3 commits June 27, 2024 15:59

Add draft of paper

83bc668

add joss paper workflow

452eb2d

Added all sections except benchmarks and testing details

7d44781

ASKabalan force-pushed the joss-paper branch from 8877e89 to 7d44781 Compare June 27, 2024 17:08

ASKabalan added 2 commits June 27, 2024 21:45

make joss paper

e02dc63

add example

2d3abad

ASKabalan force-pushed the joss-paper branch from c5a7a3f to 2d3abad Compare July 3, 2024 21:28

Add comments and 1024 LPT

85ab739

ASKabalan force-pushed the joss-paper branch 2 times, most recently from aa5403d to d053753 Compare July 4, 2024 17:41

Add 2048 LPT

88c3a16

ASKabalan force-pushed the joss-paper branch from d053753 to 88c3a16 Compare July 4, 2024 17:41

EiffL and others added 13 commits July 7, 2024 19:53

simplify example prototype

9a307cb

Merge remote-tracking branch 'origin/main' into joss-paper

7341ceb

Merge branch 'joss-paper' into EiffL/joss-paper

34206ca

Merge branch 'main' into joss-paper

1394383

Merge branch 'joss-paper' into EiffL/joss-paper

984f678

updating implementation

e465af3

updating implementation

fe60caa

Update examples/lpt_nbody_demo.py

608cb40

Co-authored-by: Wassim KABALAN <[email protected]>

updating example

d7e5c28

add back missing import

9ad2f70

clean up demo

b1b6b73

update the notebook and adding a readme

e238a1d

minor paper edits

11b1781

EiffL reviewed Jul 9, 2024

View reviewed changes

apply cleanup of axis ordering and naming

699fdb8

Merge pull request #24 from DifferentiableUniverseInitiative/EiffL/jo…

dbac275

…ss-paper Proposal for simplified demo

update paper

adb0d8b

ASKabalan force-pushed the joss-paper branch from 465033c to adb0d8b Compare July 18, 2024 12:18

ASKabalan merged commit 067ee89 into main Jul 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joss paper of jaxDecomp #20

Joss paper of jaxDecomp #20

ASKabalan commented Jun 27, 2024

EiffL commented Jul 7, 2024

EiffL commented Jul 9, 2024 •

edited

Loading

EiffL Jul 9, 2024

EiffL Jul 9, 2024

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024 •

edited

Loading

EiffL commented Jul 14, 2024

ASKabalan commented Jul 14, 2024 •

edited

Loading


		## Distributed Halo Exchange

		In a particle mesh simulation, we use the 3DFFT to estimate the force field acting on the particles. The force field is then interpolated to the particles, and the particles are moved accordingly. The particles that are close to the boundary of the local domain need to be updated using the data from the neighboring domains. This is done using a halo exchange operation. Where we pad each slice of the simulation then we perform a halo exchange operation to update the particles that are close to the boundary of the local domain.

Joss paper of jaxDecomp #20

Joss paper of jaxDecomp #20

Conversation

ASKabalan commented Jun 27, 2024

EiffL commented Jul 7, 2024

EiffL commented Jul 9, 2024 • edited Loading

EiffL Jul 9, 2024

Choose a reason for hiding this comment

EiffL Jul 9, 2024

Choose a reason for hiding this comment

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024 • edited Loading

EiffL commented Jul 14, 2024

ASKabalan commented Jul 14, 2024 • edited Loading

EiffL commented Jul 9, 2024 •

edited

Loading

EiffL commented Jul 9, 2024 •

edited

Loading

ASKabalan commented Jul 14, 2024 •

edited

Loading