Parallel p4est simulation crashes when ranks have zero elements #1973

benegee · 2024-06-10T15:44:48Z

Apparently this has been around before, see the great analysis in #1096.
However, I am still facing the following issue:

Example to reproduce

Take
https://github.com/trixi-framework/Trixi.jl/blob/main/examples/p4est_3d_dgsem/elixir_advection_amr.jl
and apply the following changes:

initial_refinement_level = 1
tspan = (0.0, 30.0)

base_level = 0,
med_level = 1, med_threshold = 10.0,
max_level = 2, max_threshold = 50.0)

Run on 2 MPI ranks.

Error 1

The simulation fails at

Trixi.jl/src/callbacks_step/analysis_dg3d_parallel.jl

Line 17 in fc15c38

l2_error = zero(func(get_node_vars(u, equations, dg, 1, 1, 1, 1), equations))

because get_node_vars tries to access the first element, which does not exist on one of the two ranks.

This can be avoided by using something like

  l2_error = zero(func(zero(SVector{nvariables(equations), eltype(u)}), equations))

Error 2

Next, the simulation fails at

Trixi.jl/src/callbacks_step/analysis_dg3d_parallel.jl

Lines 73 to 77 in fc15c38

    
           # Initialize integral with zeros of the right shape 
        
           # Pass `zero(SVector{nvariables(equations), eltype(u))}` to `func` since `u` might be empty, if the 
        
           # current rank has no elements, see also https://github.com/trixi-framework/Trixi.jl/issues/1096. 
        
           integral = zero(func(zero(SVector{nvariables(equations), eltype(u)}), 1, 1, 1, 1, 
        
                                equations, dg, args...))

because the index of the first element is passed to the anonymous function

Trixi.jl/src/callbacks_step/analysis_dg3d.jl

Lines 259 to 264 in fc15c38

    
           integrate_via_indices(u, mesh, equations, dg, cache, 
        
                                 du) do u, i, j, k, element, equations, dg, du 
        
               u_node = get_node_vars(u, equations, dg, i, j, k, element) 
        
               du_node = get_node_vars(du, equations, dg, i, j, k, element) 
        
               dot(cons2entropy(u_node, equations), du_node) 
        
           end

A quick fix was to replace the body by something like

if nelements(mesh, dg, cache) == 0
  zero(eltype(u))
else
  # see above
end

Things then work as expected. But how would I solve this properly?

The text was updated successfully, but these errors were encountered:

benegee · 2024-06-19T14:01:02Z

Sorry to bother you, @JoshuaLampert @lchristm @ranocha @sloede, I saw you have gone through this before.
Do you have a suggestion how to fix the issues described above?

sloede · 2024-06-20T11:08:59Z

This is a conceptually hard problem:

We want to compute the integral over the output of a somewhat arbitrary function
The only way to assess the output type of such a function is by actually calling it
To be able to call it, we have to feed it somewhat sensible data (structures) in u
This data (structure) for u does not exist in a rank has zero elements.

My main question right now is: how did this ever work in the first place? Other than that, the only "clean" solution I see at the moment is to exclude zero-element ranks from the integral computations, which would require creating a new MPI communicator. While this is conceptually something that we should be doing anyways (and need to think about at some point in the intermediate future), it is very much non-trivial to implement.

Any other ideas/thoughts on this?

lchristm · 2024-06-20T12:26:13Z

I am also a bit puzzled as to how this is working so far at all. I agree that the solution proposed by @sloede would be the cleanest and most reasonable. A quick and dirty fix may be something like the following:

Assuming that

root is always non-empty (which is true as far as I can tell) and
the return types of the functions do not change over time,

we could determine the return types during callback initialization on root and distribute them via MPI.bcast. All processes could store this information somewhere and then use that instead of calling the function like zero(func(something...)).

benegee added bug Something isn't working question Further information is requested parallelization Related to MPI, threading, tasks etc. labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel p4est simulation crashes when ranks have zero elements #1973

Parallel p4est simulation crashes when ranks have zero elements #1973

benegee commented Jun 10, 2024 •

edited by sloede

Loading

benegee commented Jun 19, 2024

sloede commented Jun 20, 2024

lchristm commented Jun 20, 2024 •

edited

Loading

Parallel p4est simulation crashes when ranks have zero elements #1973

Parallel p4est simulation crashes when ranks have zero elements #1973

Comments

benegee commented Jun 10, 2024 • edited by sloede Loading

Example to reproduce

Error 1

Error 2

benegee commented Jun 19, 2024

sloede commented Jun 20, 2024

lchristm commented Jun 20, 2024 • edited Loading

benegee commented Jun 10, 2024 •

edited by sloede

Loading

lchristm commented Jun 20, 2024 •

edited

Loading