Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed_hydrostatic_turbulence.jl yields NaNs #4068

Open
francispoulin opened this issue Jan 28, 2025 · 23 comments
Open

distributed_hydrostatic_turbulence.jl yields NaNs #4068

francispoulin opened this issue Jan 28, 2025 · 23 comments

Comments

@francispoulin
Copy link
Collaborator

I have tried running all the scripts in distributed_simulations and they all failed for me. I thought I would point out the problems I have found here and we could clean them up. First, let's start with this one.

distributed_hydrostatic_turbulence.jl

It starts fine but then I get NaN, which suggests to me that the time step is too large. I reduced the cfl parameter from 0.2 to 0.1 and instead of dying at iteration 200 it died at 6100. Better but not great.

I am going to try 0.05 but is it a concern that it ran several months ago with these parameters and now it doesn't? Also, why does the cfl need to be so small? I would think that a cfl of 0.2 should be great for pretty much any simulation.

@glwagner
Copy link
Member

I don't think you can actually adapt the time step right now. Try with a fixed time step.

Also the setup looks weird to me:

    model = HydrostaticFreeSurfaceModel(; grid,
                                        momentum_advection = VectorInvariant(vorticity_scheme=WENO(order=9)),
                                        free_surface = SplitExplicitFreeSurface(grid, substeps=10),
                                        tracer_advection = WENO(),
                                        buoyancy = nothing,
                                        coriolis = FPlane(f = 1),
                                        tracers = :c)

This is WENO for vorticity but second order for everything else? And no other closure. Can you modify the physics so that we have a hope the simulation will run? You want to use WENOVectorInvariant. Also 10 substeps for split explicit is too few probably.

@francispoulin
Copy link
Collaborator Author

Very good idea! I just noticed that it was proceeding with a time step of 1e-84 before it produced an NaN.

@glwagner
Copy link
Member

Might make sense to try it in serial and make sure the setup runs before trying to distribute it.

@francispoulin
Copy link
Collaborator Author

This is my serial version of the hydrostatic script. It fails for me after 100 iterations with NaN in the u field.

Maybe the suggestions that @glwagner made will fix this up and then we can go to the distributed version?

using Oceananigans
using Oceananigans.Models.HydrostaticFreeSurfaceModels: VerticalVorticityField
using Printf
using Statistics
using Oceananigans.BoundaryConditions
using Random
using JLD2
using Oceananigans.ImmersedBoundaries: ActiveCellsIBG, active_interior_map

function run_simulation(nx, ny; topology = (Periodic, Periodic, Bounded))
    grid = RectilinearGrid(CPU(); topology, size = (Nx, Ny, 10), extent=(4π, 4π, 0.5), halo=(8, 8, 8))
    
    bottom(x, y) = (x > π && x < 3π/2 && y > π/2 && y < 3π/2) ? 1.0 : - grid.Lz - 1.0
    grid = ImmersedBoundaryGrid(grid, GridFittedBottom(bottom); active_cells_map = true)

    model = HydrostaticFreeSurfaceModel(; grid,
                                        momentum_advection = VectorInvariant(vorticity_scheme=WENO(order=9)),
                                        free_surface = SplitExplicitFreeSurface(grid, substeps=10),
                                        tracer_advection = WENO(),
                                        buoyancy = nothing,
                                        coriolis = FPlane(f = 1),
                                        tracers = :c)

    Random.seed!(1234)

    set!(model, u = (x, y, z) -> 1-2rand(), v = (x, y, z) -> 1-2rand())
    
    mask(x, y, z) = x > 3π/2 && x < 5π/2 && y > 3π/2 && y < 5π/2
    c = model.tracers.c

    set!(c, mask)

    u, v, _ = model.velocities
    η = model.free_surface.η
    outputs = merge(model.velocities, model.tracers)

    progress(sim) = @info "Iteration: $(sim.model.clock.iteration), time: $(sim.model.clock.time), Δt: $(sim.Δt)"
    simulation = Simulation(model, Δt=0.02, stop_time=100.0)

    simulation.callbacks[:progress] = Callback(progress, IterationInterval(100))
 
    filepath = "hydrostatic_turbulence"
    simulation.output_writers[:fields] =
        JLD2OutputWriter(model, outputs, filename=filepath, schedule=TimeInterval(0.1),
                         overwrite_existing=true)

    run!(simulation)
end

Nx, Ny = 32, 32

run_simulation(Nx, Ny)

@simone-silvestri
Copy link
Collaborator

Oh, this validation is a little old and not up to date. I ll open a PR to correct it.

@francispoulin
Copy link
Collaborator Author

Thank you @simone-silvestri !

@glwagner
Copy link
Member

For reference the script is here: https://github.com/CliMA/Oceananigans.jl/blob/main/validation/distributed_simulations/distributed_hydrostatic_turbulence.jl

@francispoulin
Copy link
Collaborator Author

Any idea when we might get a version of this script working?

@glwagner
Copy link
Member

@francispoulin the only thing required to make a simulation distributed is to use arch=Distrubted(CPU()) or arch = Distributed(GPU()). What are you trying to do?

@glwagner
Copy link
Member

glwagner commented Feb 27, 2025

try this

using Oceananigans
using MPI
using Random

arch = Distributed(CPU())
Nx = Ny = 512
Nz = 3
grid = RectilinearGrid(arch; size = (Nx, Ny, Nz), extent=(4π, 4π, 1), halo=(7, 7, 7))

model = HydrostaticFreeSurfaceModel(; grid,
                                    momentum_advection = WENOVectorInvariant(),
                                    free_surface = SplitExplicitFreeSurface(grid, substeps=30),
                                    tracer_advection = WENO())

# Scale seed with rank to avoid symmetry
local_rank = MPI.Comm_rank(arch.communicator)
Random.seed!(1234 * (local_rank + 1))
ϵ(x, y, z) = 1 - 2rand()
set!(model, u=ϵ, v=ϵ)

Δx = minimum_xspacing(grid)
Δt = 0.1 * Δx
simulation = Simulation(model; Δt, stop_iteration=1000)

function progress(sim)
    max_u = maximum(abs, sim.model.velocities.u)
    msg = string("Iteration: ", iteration(sim), ", time: ", time(sim),
                 ", max|u|: ", max_u)
    @info msg
    return nothing
end

add_callback!(simulation, progress, IterationInterval(100))

outputs = merge(model.velocities, model.tracers)
filepath = "mpi_hydrostatic_turbulence_rank$(local_rank)"
simulation.output_writers[:fields] = JLD2OutputWriter(model, outputs,
                                                      filename = filepath,
                                                      schedule = IterationInterval(100),
                                                      overwrite_existing = true)

run!(simulation)

to get this running I put it in a file called test.jl. Then I first installed mpiexecjl:

$ julia --project -e 'using MPI; MPI.install_mpiexecjl()'

then I ran it

$ ~/.julia/bin/mpiexecjl julia --project -n 4 test.jl

This gives me

$ ~/.julia/bin/mpiexecjl -n 4 julia --project test.jl
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999998087078485
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999957980743734
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999933228349054
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999968905665666
[ Info:     ... simulation initialization complete (10.738 seconds)
[ Info:     ... simulation initialization complete (10.906 seconds)
[ Info:     ... simulation initialization complete (11.060 seconds)
[ Info:     ... simulation initialization complete (11.054 seconds)
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (7.569 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info:     ... initial time step complete (7.574 seconds).
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9503706281480298
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8385581172199201
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8257834433119515
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9111077759702791
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.7082530184408709
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6166522753719914
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6326766304269028
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6696891471532802
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.519461930377305
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.6280885299424657
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.600733174262809
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.5345450628673847
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5214418560211237
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4554533513488053
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5213253717944439
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4854899439616603

@glwagner
Copy link
Member

Do you specifically need the script mentioned in this PR or are you just trying to run distributed simulations in general?

@francispoulin
Copy link
Collaborator Author

Thanks @glwagner !

I am looking for an example to either CPUs or GPUs on parallel. I want to start with CPUs. Your example looks great and I'm happy to learn from that.

I tried this on my laptop and two servers. One server is still running so it might of worked. Any ideas what these errors mean?

The second server it failed with this error:

[ Info: MPI has not been initialized, so we are calling MPI.Init().
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orte_init:startup:internal-failure
But I couldn't open the help file:
    /u/fpoulin/.julia/artifacts/689e8d7a066e74f34108119415bd9e46a2f8b168/share/openmpi/help-orte-runtime: No such file or directory.  Sorry!

My laptop failed with this error:

$ ~/.julia/bin/mpiexecjl -n 4 julia --project test_mpi.jl 
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
Attempting to use an MPI routine (internal_Comm_size) before initializing or after finalizing MPICH
ERROR: failed process: Process(`/home/fpoulin/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/bin/mpiexec -n 4 julia test_mpi.jl`, ProcessExited(1)) [1]

Stacktrace:
  [1] pipeline_error
    @ Base ./process.jl:565 [inlined]
  [2] run(::Cmd; wait::Bool)
    @ Base ./process.jl:480
  [3] run(::Cmd)
    @ Base process.jl:477
  [4] (::var"#1#2")(exe::String)
    @ Main none:4
  [5] (::JLLWrappers.var"#2#3"{var"#1#2", String})()
    @ JLLWrappers ~/.julia/packages/JLLWrappers/GfYNv/src/runtime.jl:49
  [6] withenv(::JLLWrappers.var"#2#3"{var"#1#2", String}, ::Pair{String, String}, ::Vararg{Pair{String, String}})
    @ Base env.jl:256
  [7] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ JLLWrappers ~/.julia/packages/JLLWrappers/GfYNv/src/runtime.jl:48
  [8] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
    @ Base ./essentials.jl:887
  [9] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:884
 [10] mpiexec(f::Function; adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ MPICH_jll ~/.julia/packages/JLLWrappers/GfYNv/src/products/executable_generators.jl:28
 [11] mpiexec(f::Function)
    @ MPICH_jll ~/.julia/packages/JLLWrappers/GfYNv/src/products/executable_generators.jl:25
 [12] top-level scope
    @ none:4

@glwagner
Copy link
Member

I think there is a problem with your MPI build. It can be tricky. Check out the docs for MPI.jl. Sometimes the best approach is to use MPITrampoline.

@glwagner
Copy link
Member

The laptop is weird because if you do not call MPI.Init() manually, then you should get an info statement that Init is being called (this line is present in your server log).

I suggest trying to get some simple MPI code to run first (not using Oceananigans), and once you are sure MPI is working, then turn to Oceananigans.

There is also some info here: #3669

Possibly, we can start a discussion thread for MPI as well.

@francispoulin
Copy link
Collaborator Author

Thanks @glwagner . I have asked for some help and now have it working on both servers. My laptop is not so important so I will put that on hold for a while.

I tried switching this from a CPU to a GPU and it seems to fail. Even on one GPU. Can you confirm that this works for you if you switch CPU with GPU?

@glwagner
Copy link
Member

glwagner commented Mar 5, 2025

how does it fail?

@glwagner
Copy link
Member

glwagner commented Mar 5, 2025

I am super swamped but will try to find time for this. It would be really amazing if you can help with the docs on using distributed stuff. We need docs for both hydrostatic and nonhydrostatic. One thing we were struggling with is how to doctest; ie we cannot run distributed cases within documenter.

check this out too:

#3698

@glwagner
Copy link
Member

glwagner commented Mar 5, 2025

Just have faith that everything works befcause there are tests. Maybe check out the tests to see? We need to figure out how to make it easier to run these things.

@francispoulin
Copy link
Collaborator Author

Thanks @glwagner . I'm happy to work on the docs for this as I would like to learn how it works, and writing docs seems like the best way to do that.

Some good news! I managed to get it running on one of the two servers. Interesting how we need to have using MPI then add MPI.Init() before loading Oceananigans. But it does work so I'm excited about that.

The other server seems to have problems with LLVM.jl, see below.
I know that I updated it this morning and it updated LLVM.jl, but now it's still complaining. Maybe this is an example of me in library hell? I could delete my .julia and start again I suppose.

No problem if you are busy. I think I can look through the examples and narrow down waht the problem might be.

[fpoulin@cdr2637 Oceananigans.jl]$ julia --project test_gpu.jl
ERROR: LoadError: LLVM.jl only supports LLVM 15 and later.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] top-level scope
   @ ~/.julia/packages/LLVM/b3kFs/src/LLVM.jl:40
 [3] include
   @ ./Base.jl:457 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String)
   @ Base ./loading.jl:2045
 [5] top-level scope
   @ stdin:3
in expression starting at /home/fpoulin/.julia/packages/LLVM/b3kFs/src/LLVM.jl:1
in expression starting at stdin:3

@glwagner
Copy link
Member

glwagner commented Mar 5, 2025

Interesting how we need to have using MPI then add MPI.Init() before loading Oceananigans. But it does work so I'm excited about that.

Very good to know.

@glwagner
Copy link
Member

glwagner commented Mar 5, 2025

What version of Julia are you using?

@francispoulin
Copy link
Collaborator Author

1.10.0 was the version I was using. Should I try 1.11?

@glwagner
Copy link
Member

glwagner commented Mar 7, 2025

1.10.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants