Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel p4est simulation crashes when ranks have zero elements #1973

Open
benegee opened this issue Jun 10, 2024 · 3 comments
Open

Parallel p4est simulation crashes when ranks have zero elements #1973

benegee opened this issue Jun 10, 2024 · 3 comments
Labels
bug Something isn't working parallelization Related to MPI, threading, tasks etc. question Further information is requested

Comments

@benegee
Copy link
Contributor

benegee commented Jun 10, 2024

Apparently this has been around before, see the great analysis in #1096.
However, I am still facing the following issue:

Example to reproduce

Take
https://github.com/trixi-framework/Trixi.jl/blob/main/examples/p4est_3d_dgsem/elixir_advection_amr.jl
and apply the following changes:

  • initial_refinement_level = 1
  • tspan = (0.0, 30.0)
  • base_level = 0,
    med_level = 1, med_threshold = 10.0,
    max_level = 2, max_threshold = 50.0)
    

Run on 2 MPI ranks.

Error 1

The simulation fails at

l2_error = zero(func(get_node_vars(u, equations, dg, 1, 1, 1, 1), equations))

because get_node_vars tries to access the first element, which does not exist on one of the two ranks.

This can be avoided by using something like

  l2_error = zero(func(zero(SVector{nvariables(equations), eltype(u)}), equations))

Error 2

Next, the simulation fails at

# Initialize integral with zeros of the right shape
# Pass `zero(SVector{nvariables(equations), eltype(u))}` to `func` since `u` might be empty, if the
# current rank has no elements, see also https://github.com/trixi-framework/Trixi.jl/issues/1096.
integral = zero(func(zero(SVector{nvariables(equations), eltype(u)}), 1, 1, 1, 1,
equations, dg, args...))

because the index of the first element is passed to the anonymous function
integrate_via_indices(u, mesh, equations, dg, cache,
du) do u, i, j, k, element, equations, dg, du
u_node = get_node_vars(u, equations, dg, i, j, k, element)
du_node = get_node_vars(du, equations, dg, i, j, k, element)
dot(cons2entropy(u_node, equations), du_node)
end

A quick fix was to replace the body by something like

if nelements(mesh, dg, cache) == 0
  zero(eltype(u))
else
  # see above
end

Things then work as expected. But how would I solve this properly?

@benegee benegee added bug Something isn't working question Further information is requested parallelization Related to MPI, threading, tasks etc. labels Jun 10, 2024
@benegee
Copy link
Contributor Author

benegee commented Jun 19, 2024

Sorry to bother you, @JoshuaLampert @lchristm @ranocha @sloede, I saw you have gone through this before.
Do you have a suggestion how to fix the issues described above?

@sloede
Copy link
Member

sloede commented Jun 20, 2024

This is a conceptually hard problem:

  • We want to compute the integral over the output of a somewhat arbitrary function
  • The only way to assess the output type of such a function is by actually calling it
  • To be able to call it, we have to feed it somewhat sensible data (structures) in u
  • This data (structure) for u does not exist in a rank has zero elements.

My main question right now is: how did this ever work in the first place? Other than that, the only "clean" solution I see at the moment is to exclude zero-element ranks from the integral computations, which would require creating a new MPI communicator. While this is conceptually something that we should be doing anyways (and need to think about at some point in the intermediate future), it is very much non-trivial to implement.

Any other ideas/thoughts on this?

@lchristm
Copy link
Member

lchristm commented Jun 20, 2024

I am also a bit puzzled as to how this is working so far at all. I agree that the solution proposed by @sloede would be the cleanest and most reasonable. A quick and dirty fix may be something like the following:

Assuming that

  • root is always non-empty (which is true as far as I can tell) and
  • the return types of the functions do not change over time,

we could determine the return types during callback initialization on root and distribute them via MPI.bcast. All processes could store this information somewhere and then use that instead of calling the function like zero(func(something...)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parallelization Related to MPI, threading, tasks etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants