Skip to content

Conversation

@navidcy
Copy link
Member

@navidcy navidcy commented Oct 9, 2025

No description provided.

@navidcy navidcy added the testing 🧪 Tests get priority in case of emergency evacuation label Oct 9, 2025
@simone-silvestri
Copy link
Collaborator

I am very interested in this. Let's hope it works and we can move on from julia 1.10

@simone-silvestri
Copy link
Collaborator

I am disabling the reactant tests for the moment to check if the rest works.

@simone-silvestri
Copy link
Collaborator

If docs still break on the internal_tide.jl example NaNing, I would try with a GridFittedBottom instead of a PartialCellBottom, which is not even tested, and, if tests pass, merge this PR and move forward trying to assess what the problems with partial cells are.

@simone-silvestri
Copy link
Collaborator

Seems that we are hitting the same NaN issue on the internal tide example

@navidcy
Copy link
Member Author

navidcy commented Oct 9, 2025

Seems that we are hitting the same NaN issue on the internal tide example

the ghosts of the past still haunt us....

@simone-silvestri
Copy link
Collaborator

Apparently also GridFittedBottom fails...

@simone-silvestri
Copy link
Collaborator

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

@ali-ramadhan
Copy link
Member

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

One thing to try might be to run the example locally and on CI using the exact same Manifest.toml if possible. We can commit a Manifest.toml to this branch for debugging. I can't think of which dependency would lead to such a big difference but it's one thing we can control for.

@navidcy
Copy link
Member Author

navidcy commented Oct 16, 2025

From the Julia v1.11 chat I recall that the error was showing up only for unix, not for mac?

@giordano
Copy link
Collaborator

With this environment (manifests for v1.11 and v1.12 both included) environment.tar.gz, https://github.com/CliMA/Oceananigans.jl/blob/ea25179c7af868175fc295c2bf1dbfe78ec3cd4f/examples/internal_tide.jl works on macOS but not on Linux (tested only Ubuntu so far), doesn't seem to be a matter of package versions.

@giordano
Copy link
Collaborator

giordano commented Oct 26, 2025

I can make the simulation error early with

diff --git a/src/Diagnostics/nan_checker.jl b/src/Diagnostics/nan_checker.jl
index 57945c5dc..893a9e283 100644
--- a/src/Diagnostics/nan_checker.jl
+++ b/src/Diagnostics/nan_checker.jl
@@ -5,7 +5,7 @@ mutable struct NaNChecker{F}
     erroring :: Bool
 end
 
-NaNChecker(fields) = NaNChecker(fields, false) # default
+NaNChecker(fields) = NaNChecker(fields, true) # default
 default_nan_checker(model) = nothing
 
 function Base.summary(nc::NaNChecker)
@@ -28,7 +28,7 @@ a container with key-value pairs like a dictionary or `NamedTuple`.
 
 If `erroring=true`, the `NaNChecker` will throw an error on NaN detection.
 """
-NaNChecker(; fields, erroring=false) = NaNChecker(fields, erroring)
+NaNChecker(; fields, erroring=true) = NaNChecker(fields, erroring)
 
 hasnan(field::AbstractArray) = any(isnan, parent(field))
 hasnan(model) = hasnan(first(fields(model)))

I presume there's also a way to set the NaNChecker explicitly for a simulation, but I couldn't quickly figure it out.

Can we use a callback to print out to file all the steps, so that we can compare 1:1 the progress on different machines? Presumably we're initially interested in the field u, that's what the error is about at least:

julia> run!(simulation)
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (17.810 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (9.452 ms).
ERROR: time = 60000.0, iteration = 200: NaN found in field u. Aborting simulation.

@giordano
Copy link
Collaborator

giordano commented Oct 26, 2025

Before

run!(simulation)
I see

julia> simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.281029, min=0.281029, mean=0.281029

on both machines, if I'm looking at the right field and this display says enough about it, then they're the same at the beginning, but then on macOS I have

julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.256 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (887.307 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (128.489 ms).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967

julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.335864, min=0.264486, mean=0.280859

and on Ubuntu

julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.391 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (1.130 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (20.645 ms).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967

julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.333478, min=0.264645, mean=0.280863

so there's a significant divergence already after two timesteps.

Update:

julia> time_step!(simulation); simulation.model.velocities.u
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds, wall time: 2.269 minutes, max|w|: 2.089e-03, m s⁻¹
[ Info:     ... simulation initialization complete (11.788 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (12.640 seconds).
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.31715, min=0.265116, mean=0.280967

julia> time_step!(simulation); simulation.model.velocities.u
256×1×128 Field{Face, Center, Center} on ImmersedBoundaryGrid on CPU
├── grid: 256×1×128 ImmersedBoundaryGrid{Float64, Periodic, Flat, Bounded} on CPU with 4×0×4 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 264×1×136 OffsetArray(::Array{Float64, 3}, -3:260, 1:1, -3:132) with eltype Float64 with indices -3:260×1:1×-3:132
    └── max=0.335864, min=0.264486, mean=0.280859

is also what I see on Ubuntu with Julia v1.10, which is consistent with all versions of Julia on macOS.

@giordano
Copy link
Collaborator

giordano commented Oct 27, 2025

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

@glwagner
Copy link
Member

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

Nice work so far though!!

The entire time-step is a complex chain of operations. I do think it is a good start to save down all fields every time-step. We may find that differences arise in one field versus another. Note that the NaNChecker checks u only as a proxy for the entire state. Here the prognostic state should be model.velocities (u, v, w) and model.tracers.b.

@glwagner
Copy link
Member

glwagner commented Oct 27, 2025

To save every iteration chnage this line

schedule = TimeInterval(save_fields_interval),

to schedule = IterationInterval(1). Also I think we should add v (I see u and w but not v there).

The difference should arise in the very first time-step? We could compare those. It seems annoying laborious to do this across architectures, but maybe @giordano you have good ideas how to do this efficiently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing 🧪 Tests get priority in case of emergency evacuation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants