pure-MPI implementation in a Podman container #203

moustakas · 2024-12-29T16:34:35Z

This PR implements a pure-MPI version of FastSpecFit, which does not make use of multiprocessing at all. In addition, a given production (e.g., Y3/Loa) can be run entirely out of a Podman container, which gives us full control over the input software stack (see here for details, including the instructions file).

In production, parallelism is controlled by bin/mpi-fastspecfit. The number of MPI tasks in the mpi4py.MPI.COMM_WORLD communicator (as given by srun --ntasks=n) is split into int(ceil(n/mp)) sub-communicators, where mp is the number of desired ranks per sub-communicator. Using the rank=0 ranks in the sub-communicators, we parallelize over healpixels; in addition, all the ranks in a given sub-communicator are used to parallelize over objects / targets in fastspecfit.fastspec(). In particular, in fastspecfit.fastspec() we use comm.send and comm.recv (send and receive) to make sure each rank only sees the data it needs, which should help prevent memory overflow problems.

I'll post timing tests shortly. And at the moment, there is one issue that I have not been able to track down, but perhaps others may have some ideas.

…itsio#414

moustakas added 30 commits December 23, 2024 05:20

abandon MPIPoolExecutor

086a5f1

begin developing a pure-MPI implementation

c705036

planning now MPI-parallelized

d7d22ab

mpi-fastspecfit now distributes work to subcommunicators

1a5547b

working version

827b836

conflicting logging

8500ed0

debugging

e35fe6d

move parse function to top

2e4dd2e

cleanup; better logging

0ff6a0b

begin adding podman container stuff back in

9cb7af0

Merge branch 'main' into pure-MPI

e4ba234

recursively copy the subdirectories in test/data

3323f73

updated containerfile

2e7e1f8

more README; better per-rank timing info

122efa4

install libbz2-dev to fix failing fitsio installation; see esheldon/f…

2cc298e

…itsio#414

blarg, syntax error

f2cb440

add profile keyword to mpi-fastspecfit

67c9b72

install mkl and mkl_fft

2e15d7d

coverage tests also need libbz2

a1a7c8e

install mkl_fft at the end

0c2206c

try using fakeintel.c in container

2ac174f

oops typo [ci skip]

4662cc7

write profile files to outdir-data

cbdf46f

log profiling to stdout [ci skip]

5956400

bug in populate_emtable when nmonte=0

ab02617

always suppress astropy units nanomaggies warning

a79a63b

pass --nmonte, --seed options to mpi-fastspecfit

a86484d

try miniconda in lieu of pip

b18433f

deprecate base container; working version with python 3.12

6d90571

switch to miniforge for speed; call MPI.finalize()

8d6f2e0

moustakas added 7 commits December 28, 2024 22:04

set MKL environment variables in the container

6518b11

capture inf from float32 overflow

98c202e

update change log

311b19c

multiprocessing spawn does not work...but fork does

4fbc581

bug fix: mpi-fastspecfit --merge now works with new data model

e0d2fc6

return 0 on clean exit so all ranks clean up gracefully

afae2a0

default Containerfile excludes mkl_fft

8e508e5

moustakas merged commit 42bf09d into main Dec 31, 2024
14 of 16 checks passed

moustakas deleted the pure-MPI branch December 31, 2024 14:13

moustakas mentioned this pull request Dec 31, 2024

timing tests with and without Podman/MPI #204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pure-MPI implementation in a Podman container #203

pure-MPI implementation in a Podman container #203

moustakas commented Dec 29, 2024

pure-MPI implementation in a Podman container #203

pure-MPI implementation in a Podman container #203

Conversation

moustakas commented Dec 29, 2024