`distributed_hydrostatic_turbulence.jl` yields `NaN`s #4068

francispoulin · 2025-01-28T12:42:53Z

I have tried running all the scripts in distributed_simulations and they all failed for me. I thought I would point out the problems I have found here and we could clean them up. First, let's start with this one.

distributed_hydrostatic_turbulence.jl

It starts fine but then I get NaN, which suggests to me that the time step is too large. I reduced the cfl parameter from 0.2 to 0.1 and instead of dying at iteration 200 it died at 6100. Better but not great.

I am going to try 0.05 but is it a concern that it ran several months ago with these parameters and now it doesn't? Also, why does the cfl need to be so small? I would think that a cfl of 0.2 should be great for pretty much any simulation.

The text was updated successfully, but these errors were encountered:

glwagner · 2025-01-28T12:45:41Z

I don't think you can actually adapt the time step right now. Try with a fixed time step.

Also the setup looks weird to me:

    model = HydrostaticFreeSurfaceModel(; grid,
                                        momentum_advection = VectorInvariant(vorticity_scheme=WENO(order=9)),
                                        free_surface = SplitExplicitFreeSurface(grid, substeps=10),
                                        tracer_advection = WENO(),
                                        buoyancy = nothing,
                                        coriolis = FPlane(f = 1),
                                        tracers = :c)

This is WENO for vorticity but second order for everything else? And no other closure. Can you modify the physics so that we have a hope the simulation will run? You want to use WENOVectorInvariant. Also 10 substeps for split explicit is too few probably.

francispoulin · 2025-01-28T12:49:51Z

Very good idea! I just noticed that it was proceeding with a time step of 1e-84 before it produced an NaN.

glwagner · 2025-01-28T12:50:48Z

Might make sense to try it in serial and make sure the setup runs before trying to distribute it.

francispoulin · 2025-01-28T13:11:31Z

This is my serial version of the hydrostatic script. It fails for me after 100 iterations with NaN in the u field.

Maybe the suggestions that @glwagner made will fix this up and then we can go to the distributed version?

using Oceananigans
using Oceananigans.Models.HydrostaticFreeSurfaceModels: VerticalVorticityField
using Printf
using Statistics
using Oceananigans.BoundaryConditions
using Random
using JLD2
using Oceananigans.ImmersedBoundaries: ActiveCellsIBG, active_interior_map

function run_simulation(nx, ny; topology = (Periodic, Periodic, Bounded))
    grid = RectilinearGrid(CPU(); topology, size = (Nx, Ny, 10), extent=(4π, 4π, 0.5), halo=(8, 8, 8))
    
    bottom(x, y) = (x > π && x < 3π/2 && y > π/2 && y < 3π/2) ? 1.0 : - grid.Lz - 1.0
    grid = ImmersedBoundaryGrid(grid, GridFittedBottom(bottom); active_cells_map = true)

    model = HydrostaticFreeSurfaceModel(; grid,
                                        momentum_advection = VectorInvariant(vorticity_scheme=WENO(order=9)),
                                        free_surface = SplitExplicitFreeSurface(grid, substeps=10),
                                        tracer_advection = WENO(),
                                        buoyancy = nothing,
                                        coriolis = FPlane(f = 1),
                                        tracers = :c)

    Random.seed!(1234)

    set!(model, u = (x, y, z) -> 1-2rand(), v = (x, y, z) -> 1-2rand())
    
    mask(x, y, z) = x > 3π/2 && x < 5π/2 && y > 3π/2 && y < 5π/2
    c = model.tracers.c

    set!(c, mask)

    u, v, _ = model.velocities
    η = model.free_surface.η
    outputs = merge(model.velocities, model.tracers)

    progress(sim) = @info "Iteration: $(sim.model.clock.iteration), time: $(sim.model.clock.time), Δt: $(sim.Δt)"
    simulation = Simulation(model, Δt=0.02, stop_time=100.0)

    simulation.callbacks[:progress] = Callback(progress, IterationInterval(100))
 
    filepath = "hydrostatic_turbulence"
    simulation.output_writers[:fields] =
        JLD2OutputWriter(model, outputs, filename=filepath, schedule=TimeInterval(0.1),
                         overwrite_existing=true)

    run!(simulation)
end

Nx, Ny = 32, 32

run_simulation(Nx, Ny)

simone-silvestri · 2025-01-28T15:33:58Z

Oh, this validation is a little old and not up to date. I ll open a PR to correct it.

francispoulin · 2025-01-28T19:23:49Z

Thank you @simone-silvestri !

glwagner · 2025-01-29T10:16:08Z

For reference the script is here: https://github.com/CliMA/Oceananigans.jl/blob/main/validation/distributed_simulations/distributed_hydrostatic_turbulence.jl

francispoulin · 2025-02-27T17:42:15Z

Any idea when we might get a version of this script working?

glwagner · 2025-02-27T17:59:56Z

@francispoulin the only thing required to make a simulation distributed is to use arch=Distrubted(CPU()) or arch = Distributed(GPU()). What are you trying to do?

glwagner · 2025-02-27T18:10:44Z

try this

using Oceananigans
using MPI
using Random

arch = Distributed(CPU())
Nx = Ny = 512
Nz = 3
grid = RectilinearGrid(arch; size = (Nx, Ny, Nz), extent=(4π, 4π, 1), halo=(7, 7, 7))

model = HydrostaticFreeSurfaceModel(; grid,
                                    momentum_advection = WENOVectorInvariant(),
                                    free_surface = SplitExplicitFreeSurface(grid, substeps=30),
                                    tracer_advection = WENO())

# Scale seed with rank to avoid symmetry
local_rank = MPI.Comm_rank(arch.communicator)
Random.seed!(1234 * (local_rank + 1))
ϵ(x, y, z) = 1 - 2rand()
set!(model, u=ϵ, v=ϵ)

Δx = minimum_xspacing(grid)
Δt = 0.1 * Δx
simulation = Simulation(model; Δt, stop_iteration=1000)

function progress(sim)
    max_u = maximum(abs, sim.model.velocities.u)
    msg = string("Iteration: ", iteration(sim), ", time: ", time(sim),
                 ", max|u|: ", max_u)
    @info msg
    return nothing
end

add_callback!(simulation, progress, IterationInterval(100))

outputs = merge(model.velocities, model.tracers)
filepath = "mpi_hydrostatic_turbulence_rank$(local_rank)"
simulation.output_writers[:fields] = JLD2OutputWriter(model, outputs,
                                                      filename = filepath,
                                                      schedule = IterationInterval(100),
                                                      overwrite_existing = true)

run!(simulation)

to get this running I put it in a file called test.jl. Then I first installed mpiexecjl:

$ julia --project -e 'using MPI; MPI.install_mpiexecjl()'

then I ran it

$ ~/.julia/bin/mpiexecjl julia --project -n 4 test.jl

This gives me

$ ~/.julia/bin/mpiexecjl -n 4 julia --project test.jl
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999998087078485
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999957980743734
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999933228349054
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999968905665666
[ Info:     ... simulation initialization complete (10.738 seconds)
[ Info:     ... simulation initialization complete (10.906 seconds)
[ Info:     ... simulation initialization complete (11.060 seconds)
[ Info:     ... simulation initialization complete (11.054 seconds)
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (7.569 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9503706281480298
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8385581172199201
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8257834433119515
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9111077759702791
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.7082530184408709
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6166522753719914
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6326766304269028
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6696891471532802
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.519461930377305
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.6280885299424657
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.600733174262809
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.5345450628673847
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5214418560211237
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4554533513488053
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5213253717944439
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4854899439616603

glwagner · 2025-02-27T18:12:47Z

Do you specifically need the script mentioned in this PR or are you just trying to run distributed simulations in general?

francispoulin · 2025-02-27T18:38:53Z

Thanks @glwagner !

I am looking for an example to either CPUs or GPUs on parallel. I want to start with CPUs. Your example looks great and I'm happy to learn from that.

I tried this on my laptop and two servers. One server is still running so it might of worked. Any ideas what these errors mean?

The second server it failed with this error:

[ Info: MPI has not been initialized, so we are calling MPI.Init().
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /u/fpoulin/.julia/artifacts/689e8d7a066e74f34108119415bd9e46a2f8b168/share/openmpi/help-orte-runtime: No such file or directory.  Sorry!

My laptop failed with this error:

$ ~/.julia/bin/mpiexecjl -n 4 julia --project test_mpi.jl 
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
ERROR: failed process: Process(`/home/fpoulin/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/bin/mpiexec -n 4 julia test_mpi.jl`, ProcessExited(1)) [1]

Stacktrace:
  [1] pipeline_error
    @ Base ./process.jl:565 [inlined]
  [2] run(::Cmd; wait::Bool)
    @ Base ./process.jl:480
  [3] run(::Cmd)
    @ Base process.jl:477
  [4] (::var"#1#2")(exe::String)
    @ Main none:4
  [5] (::JLLWrappers.var"#2#3"{var"#1#2", String})()
    @ JLLWrappers ~/.julia/packages/JLLWrappers/GfYNv/src/runtime.jl:49
  [6] withenv(::JLLWrappers.var"#2#3"{var"#1#2", String}, ::Pair{String, String}, ::Vararg{Pair{String, String}})
    @ Base env.jl:256
  [7] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ JLLWrappers ~/.julia/packages/JLLWrappers/GfYNv/src/runtime.jl:48
  [8] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
    @ Base ./essentials.jl:887
  [9] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:884
 [10] mpiexec(f::Function; adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ MPICH_jll ~/.julia/packages/JLLWrappers/GfYNv/src/products/executable_generators.jl:28
 [11] mpiexec(f::Function)
    @ MPICH_jll ~/.julia/packages/JLLWrappers/GfYNv/src/products/executable_generators.jl:25
 [12] top-level scope
    @ none:4

glwagner · 2025-02-27T21:00:39Z

I think there is a problem with your MPI build. It can be tricky. Check out the docs for MPI.jl. Sometimes the best approach is to use MPITrampoline.

glwagner · 2025-02-27T21:04:15Z

The laptop is weird because if you do not call MPI.Init() manually, then you should get an info statement that Init is being called (this line is present in your server log).

I suggest trying to get some simple MPI code to run first (not using Oceananigans), and once you are sure MPI is working, then turn to Oceananigans.

There is also some info here: #3669

Possibly, we can start a discussion thread for MPI as well.

francispoulin · 2025-03-05T14:12:59Z

Thanks @glwagner . I have asked for some help and now have it working on both servers. My laptop is not so important so I will put that on hold for a while.

I tried switching this from a CPU to a GPU and it seems to fail. Even on one GPU. Can you confirm that this works for you if you switch CPU with GPU?

glwagner · 2025-03-05T14:27:58Z

how does it fail?

glwagner · 2025-03-05T14:29:33Z

I am super swamped but will try to find time for this. It would be really amazing if you can help with the docs on using distributed stuff. We need docs for both hydrostatic and nonhydrostatic. One thing we were struggling with is how to doctest; ie we cannot run distributed cases within documenter.

check this out too:

#3698

glwagner · 2025-03-05T14:30:08Z

Just have faith that everything works befcause there are tests. Maybe check out the tests to see? We need to figure out how to make it easier to run these things.

francispoulin · 2025-03-05T15:35:33Z

Thanks @glwagner . I'm happy to work on the docs for this as I would like to learn how it works, and writing docs seems like the best way to do that.

Some good news! I managed to get it running on one of the two servers. Interesting how we need to have using MPI then add MPI.Init() before loading Oceananigans. But it does work so I'm excited about that.

The other server seems to have problems with LLVM.jl, see below.
I know that I updated it this morning and it updated LLVM.jl, but now it's still complaining. Maybe this is an example of me in library hell? I could delete my .julia and start again I suppose.

No problem if you are busy. I think I can look through the examples and narrow down waht the problem might be.

[fpoulin@cdr2637 Oceananigans.jl]$ julia --project test_gpu.jl
ERROR: LoadError: LLVM.jl only supports LLVM 15 and later.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] top-level scope
   @ ~/.julia/packages/LLVM/b3kFs/src/LLVM.jl:40
 [3] include
   @ ./Base.jl:457 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
   @ Base ./loading.jl:2045
 [5] top-level scope
   @ stdin:3
in expression starting at /home/fpoulin/.julia/packages/LLVM/b3kFs/src/LLVM.jl:1
in expression starting at stdin:3

glwagner · 2025-03-05T21:33:57Z

Interesting how we need to have using MPI then add MPI.Init() before loading Oceananigans. But it does work so I'm excited about that.

Very good to know.

glwagner · 2025-03-05T21:34:19Z

What version of Julia are you using?

francispoulin · 2025-03-07T13:37:37Z

1.10.0 was the version I was using. Should I try 1.11?

glwagner · 2025-03-07T13:40:05Z

1.10.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`distributed_hydrostatic_turbulence.jl` yields `NaN`s #4068

`distributed_hydrostatic_turbulence.jl` yields `NaN`s #4068

francispoulin commented Jan 28, 2025

glwagner commented Jan 28, 2025

francispoulin commented Jan 28, 2025

glwagner commented Jan 28, 2025

francispoulin commented Jan 28, 2025

simone-silvestri commented Jan 28, 2025

francispoulin commented Jan 28, 2025

glwagner commented Jan 29, 2025

francispoulin commented Feb 27, 2025

glwagner commented Feb 27, 2025

glwagner commented Feb 27, 2025 •

edited

Loading

glwagner commented Feb 27, 2025

francispoulin commented Feb 27, 2025

glwagner commented Feb 27, 2025

glwagner commented Feb 27, 2025

francispoulin commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

francispoulin commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

francispoulin commented Mar 7, 2025

glwagner commented Mar 7, 2025

distributed_hydrostatic_turbulence.jl yields NaNs #4068

distributed_hydrostatic_turbulence.jl yields NaNs #4068

Comments

francispoulin commented Jan 28, 2025

glwagner commented Jan 28, 2025

francispoulin commented Jan 28, 2025

glwagner commented Jan 28, 2025

francispoulin commented Jan 28, 2025

simone-silvestri commented Jan 28, 2025

francispoulin commented Jan 28, 2025

glwagner commented Jan 29, 2025

francispoulin commented Feb 27, 2025

glwagner commented Feb 27, 2025

glwagner commented Feb 27, 2025 • edited Loading

glwagner commented Feb 27, 2025

francispoulin commented Feb 27, 2025

glwagner commented Feb 27, 2025

glwagner commented Feb 27, 2025

francispoulin commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

francispoulin commented Mar 5, 2025

glwagner commented Mar 5, 2025

glwagner commented Mar 5, 2025

francispoulin commented Mar 7, 2025

glwagner commented Mar 7, 2025

`distributed_hydrostatic_turbulence.jl` yields `NaN`s #4068

`distributed_hydrostatic_turbulence.jl` yields `NaN`s #4068

glwagner commented Feb 27, 2025 •

edited

Loading