Skip to content

Add function eckit::mpi::unregisterComm#260

Open
wdeconinck wants to merge 2 commits intodevelopfrom
feature/eckit_mpi_unregisterComm
Open

Add function eckit::mpi::unregisterComm#260
wdeconinck wants to merge 2 commits intodevelopfrom
feature/eckit_mpi_unregisterComm

Conversation

@wdeconinck
Copy link
Member

@wdeconinck wdeconinck commented Feb 6, 2026

Description

eckit_mpi contains a registry of MPI communicators to be accessed by name.
Currently it is only possible to remove registered communicators with the function eckit::mpi::deleteComm(name).
This function has the side-effect to also call MPI_Comm_free. That is OK for communicators that are strictly managed by eckit, but not when the MPI communicators are managed by external library, and were added using eckit::mpi::addComm(name, MPI_comm_integer_handle).
Furthermore it is illegal to call MPI_Comm_free on some communicators equivalent to MPI_COMM_WORLD, MPI_COMM_SELF.

With this PR I propose a function eckit::mpi::unregisterComm(name), which is similar to eckit::mpi::deleteComm(name) but which does not call MPI_Comm_free.
I extended the unit-test eckit_test_mpi_addcomm to illustrate the existing problem with eckit::mpi::deleteComm(name), and test the new functionality.

Note that I added ECKIT_MPI_FORCE=serial hardcoded in the test. This should be harmless because ctest should have been calling this test with MPI already. This is added for safety, when just calling the test directly on command-line (e.g. on Mac) without mpiexec.

My use-case here is when interoperating Atlas with libraries and programs that have already set-up MPI communicators,
and needing to clean-up.

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

🌦️ >> Documentation << 🌦️
https://sites.ecmwf.int/docs/dev-section/eckit/pull-requests/PR-260

@wdeconinck wdeconinck requested review from Ozaq and dsarmany February 6, 2026 14:10
@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.25%. Comparing base (53ce422) to head (38c3b14).

Files with missing lines Patch % Lines
src/eckit/mpi/Comm.cc 0.00% 14 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #260   +/-   ##
========================================
  Coverage    66.25%   66.25%           
========================================
  Files         1130     1130           
  Lines        57931    57939    +8     
  Branches      4390     4393    +3     
========================================
+ Hits         38381    38390    +9     
+ Misses       19550    19549    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wdeconinck wdeconinck force-pushed the feature/eckit_mpi_unregisterComm branch from 396ad71 to 9761cd1 Compare February 6, 2026 14:21
Copy link
Member

@Ozaq Ozaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wdeconinck I have a few minor remarks, please have a look at them. I think I marked nice to haves and must haves accordingly but if you think something is unreasonable let me know!

Comment on lines +196 to +201
if (name == "world") {
throw SeriousBug("Trying to unregister the 'world' Communicator", Here());
}
if (name == "self") {
throw SeriousBug("Trying to unregister the 'self' Communicator", Here());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are only checking by the registered name, not if the pointer is actually MPI_COMM_WORLD / MPI_COMM_SELF. If I understand it correctly this should be ok because both get registered only when initDefault() is called. In case of externally managed MPI initDefault() will not be called and never be registered on here, yes?

For the sake of defensive coding, I think It could make sense to check if they now actually point to self or world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK to unregister an alias to MPI_COMM_WORLD.
e.g.

MPI_Comm external_comm = ... // could be aliasing MPI_COMM_WORLD or MPI_COMM_SELF or whatever.
// ... Now use this comm in our eckit-mpi-using-algorithm
// PROLOG
std::string initial_comm =  eckit::mpi::comm().name();
eckit::mpi::addComm("external_comm" , MPI_Comm_c2f(external_comm)); 
setCommDefault("external_comm");
// ALGORITHM
// ... eckit::mpi::comm().size() 
// EPILOG
eckit::mpi::setCommDefault(initial_comm);
eckit::mpi::unregisterComm("external_comm");

We don't want to however pass the string "world" or "self" because those are OK to be there.
We also don't want to unregister the currently selected default.

}

void deleteComm(std::string_view name) {
void unregisterComm(std::string_view name, bool free_comm = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not use a default value. Looking at the API the free functions are the public API and this is more or less an internal helper, here I would prefer clarity and explicitness and the free function should always pass the 2nd parameter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +197 to +200
throw SeriousBug("Trying to unregister the 'world' Communicator", Here());
}
if (name == "self") {
throw SeriousBug("Trying to unregister the 'self' Communicator", Here());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From EcKits perspective this is not a SeriousBug, i.e. an error inside EcKit but a UserError, i.e. someone called something on EcKit with invalid arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not agree, but open for discussion including other maintainers.
It depends on the definition of the User. Is it a developer using eckit, or an end-user of an application using eckit? My understanding is the latter, such that a UserError should be thrown when the user of the final program made a mistake. In that sense the end-user should not be concerned with adding or unregistering comms.
Otherwise the only exception to be exposed outside eckit should be UserError (developer using eckit library), keeping all other exceptions internal for eckit development.

}

void deleteComm(std::string_view name) {
void unregisterComm(std::string_view name, bool free_comm = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: For better readability I would avoid the bool here and use a enum to produce easier to parse code (human) but since this is not the public API and the use is rather simple, you can consider it a suggestion and ignore it.

If however you refactor bool parameter to enum you would get for example

unregisterComm(name, FreeComm::Yes);
// -OR-
unregisterComm(name, FreeComm::No);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// Unregister and delete specific comm
/// Unregister a communicator and call free() on the comm.
/// @pre Comm is registered in the environment
/// @note Cannot unregister 'world', 'self' or default communicators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add @throws + desc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// Unregister a communicator without calling free() on the comm, e.g. if the registered comm is wrapping an MPI
/// communicator managed by an external library
/// @pre Comm is registered in the environment
/// @note Cannot unregister 'world', 'self' or default communicators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add @throws + desc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +696 to +701
if (comm_ == MPI_COMM_WORLD) {
throw SeriousBug("Cannot free MPI_COMM_WORLD communicator");
}
if (comm_ == MPI_COMM_SELF) {
throw SeriousBug("Cannot free MPI_COMM_SELF communicator");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if SeriousBug is the right exception here, I don't have good overview of the flow that leads to this code. Please think about it for a moment and adjust it to UserError if you think its more appropriate, otherwise let it stay as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow to arrive here would be to call

eckit::mpi::deleteComm( comm_string )

where comm_string would be 'world', 'self', or an alias to these. Unlike unregisterComm, an alias is not allowed here.

Having these guards allows us to try / catch and add a check in unit-test EXPECT_THROWS(...).
That is because MPI cannot recover from MPI_Comm_free(MPI_COMM_WORLD): it aborts and terminates on all ranks, even though it is guarded with the MPI_CALL( ... ) macro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants