-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed_hydrostatic_turbulence.jl
yields NaN
s
#4068
Comments
I don't think you can actually adapt the time step right now. Try with a fixed time step. Also the setup looks weird to me: model = HydrostaticFreeSurfaceModel(; grid,
momentum_advection = VectorInvariant(vorticity_scheme=WENO(order=9)),
free_surface = SplitExplicitFreeSurface(grid, substeps=10),
tracer_advection = WENO(),
buoyancy = nothing,
coriolis = FPlane(f = 1),
tracers = :c) This is WENO for vorticity but second order for everything else? And no other closure. Can you modify the physics so that we have a hope the simulation will run? You want to use |
Very good idea! I just noticed that it was proceeding with a time step of 1e-84 before it produced an |
Might make sense to try it in serial and make sure the setup runs before trying to distribute it. |
This is my serial version of the hydrostatic script. It fails for me after 100 iterations with Maybe the suggestions that @glwagner made will fix this up and then we can go to the
|
Oh, this validation is a little old and not up to date. I ll open a PR to correct it. |
Thank you @simone-silvestri ! |
For reference the script is here: https://github.com/CliMA/Oceananigans.jl/blob/main/validation/distributed_simulations/distributed_hydrostatic_turbulence.jl |
Any idea when we might get a version of this script working? |
@francispoulin the only thing required to make a simulation distributed is to use |
try this using Oceananigans
using MPI
using Random
arch = Distributed(CPU())
Nx = Ny = 512
Nz = 3
grid = RectilinearGrid(arch; size = (Nx, Ny, Nz), extent=(4π, 4π, 1), halo=(7, 7, 7))
model = HydrostaticFreeSurfaceModel(; grid,
momentum_advection = WENOVectorInvariant(),
free_surface = SplitExplicitFreeSurface(grid, substeps=30),
tracer_advection = WENO())
# Scale seed with rank to avoid symmetry
local_rank = MPI.Comm_rank(arch.communicator)
Random.seed!(1234 * (local_rank + 1))
ϵ(x, y, z) = 1 - 2rand()
set!(model, u=ϵ, v=ϵ)
Δx = minimum_xspacing(grid)
Δt = 0.1 * Δx
simulation = Simulation(model; Δt, stop_iteration=1000)
function progress(sim)
max_u = maximum(abs, sim.model.velocities.u)
msg = string("Iteration: ", iteration(sim), ", time: ", time(sim),
", max|u|: ", max_u)
@info msg
return nothing
end
add_callback!(simulation, progress, IterationInterval(100))
outputs = merge(model.velocities, model.tracers)
filepath = "mpi_hydrostatic_turbulence_rank$(local_rank)"
simulation.output_writers[:fields] = JLD2OutputWriter(model, outputs,
filename = filepath,
schedule = IterationInterval(100),
overwrite_existing = true)
run!(simulation) to get this running I put it in a file called $ julia --project -e 'using MPI; MPI.install_mpiexecjl()' then I ran it $ ~/.julia/bin/mpiexecjl julia --project -n 4 test.jl This gives me $ ~/.julia/bin/mpiexecjl -n 4 julia --project test.jl
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 12 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Initializing simulation...
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999998087078485
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999957980743734
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999933228349054
[ Info: Iteration: 0, time: 0.0, max|u|: 0.9999968905665666
[ Info: ... simulation initialization complete (10.738 seconds)
[ Info: ... simulation initialization complete (10.906 seconds)
[ Info: ... simulation initialization complete (11.060 seconds)
[ Info: ... simulation initialization complete (11.054 seconds)
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: Executing initial time step...
[ Info: ... initial time step complete (7.569 seconds).
[ Info: ... initial time step complete (7.574 seconds).
[ Info: ... initial time step complete (7.574 seconds).
[ Info: ... initial time step complete (7.574 seconds).
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9503706281480298
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8385581172199201
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.8257834433119515
[ Info: Iteration: 100, time: 0.24543692606170325, max|u|: 0.9111077759702791
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.7082530184408709
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6166522753719914
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6326766304269028
[ Info: Iteration: 200, time: 0.49087385212340723, max|u|: 0.6696891471532802
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.519461930377305
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.6280885299424657
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.600733174262809
[ Info: Iteration: 300, time: 0.7363107781851111, max|u|: 0.5345450628673847
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5214418560211237
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4554533513488053
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.5213253717944439
[ Info: Iteration: 400, time: 0.9817477042468151, max|u|: 0.4854899439616603 |
Do you specifically need the script mentioned in this PR or are you just trying to run distributed simulations in general? |
Thanks @glwagner ! I am looking for an example to either CPUs or GPUs on parallel. I want to start with CPUs. Your example looks great and I'm happy to learn from that. I tried this on my laptop and two servers. One server is still running so it might of worked. Any ideas what these errors mean? The second server it failed with this error:
My laptop failed with this error:
|
I think there is a problem with your MPI build. It can be tricky. Check out the docs for MPI.jl. Sometimes the best approach is to use MPITrampoline. |
The laptop is weird because if you do not call I suggest trying to get some simple MPI code to run first (not using Oceananigans), and once you are sure MPI is working, then turn to Oceananigans. There is also some info here: #3669 Possibly, we can start a discussion thread for MPI as well. |
Thanks @glwagner . I have asked for some help and now have it working on both servers. My laptop is not so important so I will put that on hold for a while. I tried switching this from a CPU to a GPU and it seems to fail. Even on one GPU. Can you confirm that this works for you if you switch CPU with GPU? |
how does it fail? |
I am super swamped but will try to find time for this. It would be really amazing if you can help with the docs on using distributed stuff. We need docs for both hydrostatic and nonhydrostatic. One thing we were struggling with is how to doctest; ie we cannot run distributed cases within documenter. check this out too: |
Just have faith that everything works befcause there are tests. Maybe check out the tests to see? We need to figure out how to make it easier to run these things. |
Thanks @glwagner . I'm happy to work on the docs for this as I would like to learn how it works, and writing docs seems like the best way to do that. Some good news! I managed to get it running on one of the two servers. Interesting how we need to have The other server seems to have problems with LLVM.jl, see below. No problem if you are busy. I think I can look through the examples and narrow down waht the problem might be.
|
Very good to know. |
What version of Julia are you using? |
1.10.0 was the version I was using. Should I try 1.11? |
1.10.8 |
I have tried running all the scripts in
distributed_simulations
and they all failed for me. I thought I would point out the problems I have found here and we could clean them up. First, let's start with this one.distributed_hydrostatic_turbulence.jl
It starts fine but then I get
NaN
, which suggests to me that the time step is too large. I reduced thecfl
parameter from0.2
to0.1
and instead of dying at iteration 200 it died at 6100. Better but not great.I am going to try
0.05
but is it a concern that it ran several months ago with these parameters and now it doesn't? Also, why does the cfl need to be so small? I would think that a cfl of0.2
should be great for pretty much any simulation.The text was updated successfully, but these errors were encountered: