Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #260 +/- ##
========================================
Coverage 66.25% 66.25%
========================================
Files 1130 1130
Lines 57931 57939 +8
Branches 4390 4393 +3
========================================
+ Hits 38381 38390 +9
+ Misses 19550 19549 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…eteComm, does not call MPI_Comm_free
396ad71 to
9761cd1
Compare
Ozaq
left a comment
There was a problem hiding this comment.
Hi @wdeconinck I have a few minor remarks, please have a look at them. I think I marked nice to haves and must haves accordingly but if you think something is unreasonable let me know!
| if (name == "world") { | ||
| throw SeriousBug("Trying to unregister the 'world' Communicator", Here()); | ||
| } | ||
| if (name == "self") { | ||
| throw SeriousBug("Trying to unregister the 'self' Communicator", Here()); | ||
| } |
There was a problem hiding this comment.
You are only checking by the registered name, not if the pointer is actually MPI_COMM_WORLD / MPI_COMM_SELF. If I understand it correctly this should be ok because both get registered only when initDefault() is called. In case of externally managed MPI initDefault() will not be called and never be registered on here, yes?
For the sake of defensive coding, I think It could make sense to check if they now actually point to self or world.
There was a problem hiding this comment.
It should be OK to unregister an alias to MPI_COMM_WORLD.
e.g.
MPI_Comm external_comm = ... // could be aliasing MPI_COMM_WORLD or MPI_COMM_SELF or whatever.
// ... Now use this comm in our eckit-mpi-using-algorithm
// PROLOG
std::string initial_comm = eckit::mpi::comm().name();
eckit::mpi::addComm("external_comm" , MPI_Comm_c2f(external_comm));
setCommDefault("external_comm");
// ALGORITHM
// ... eckit::mpi::comm().size()
// EPILOG
eckit::mpi::setCommDefault(initial_comm);
eckit::mpi::unregisterComm("external_comm");We don't want to however pass the string "world" or "self" because those are OK to be there.
We also don't want to unregister the currently selected default.
src/eckit/mpi/Comm.cc
Outdated
| } | ||
|
|
||
| void deleteComm(std::string_view name) { | ||
| void unregisterComm(std::string_view name, bool free_comm = false) { |
There was a problem hiding this comment.
I think this should not use a default value. Looking at the API the free functions are the public API and this is more or less an internal helper, here I would prefer clarity and explicitness and the free function should always pass the 2nd parameter
| throw SeriousBug("Trying to unregister the 'world' Communicator", Here()); | ||
| } | ||
| if (name == "self") { | ||
| throw SeriousBug("Trying to unregister the 'self' Communicator", Here()); |
There was a problem hiding this comment.
From EcKits perspective this is not a SeriousBug, i.e. an error inside EcKit but a UserError, i.e. someone called something on EcKit with invalid arguments.
There was a problem hiding this comment.
I would not agree, but open for discussion including other maintainers.
It depends on the definition of the User. Is it a developer using eckit, or an end-user of an application using eckit? My understanding is the latter, such that a UserError should be thrown when the user of the final program made a mistake. In that sense the end-user should not be concerned with adding or unregistering comms.
Otherwise the only exception to be exposed outside eckit should be UserError (developer using eckit library), keeping all other exceptions internal for eckit development.
src/eckit/mpi/Comm.cc
Outdated
| } | ||
|
|
||
| void deleteComm(std::string_view name) { | ||
| void unregisterComm(std::string_view name, bool free_comm = false) { |
There was a problem hiding this comment.
Nitpick: For better readability I would avoid the bool here and use a enum to produce easier to parse code (human) but since this is not the public API and the use is rather simple, you can consider it a suggestion and ignore it.
If however you refactor bool parameter to enum you would get for example
unregisterComm(name, FreeComm::Yes);
// -OR-
unregisterComm(name, FreeComm::No);
| /// Unregister and delete specific comm | ||
| /// Unregister a communicator and call free() on the comm. | ||
| /// @pre Comm is registered in the environment | ||
| /// @note Cannot unregister 'world', 'self' or default communicators |
| /// Unregister a communicator without calling free() on the comm, e.g. if the registered comm is wrapping an MPI | ||
| /// communicator managed by an external library | ||
| /// @pre Comm is registered in the environment | ||
| /// @note Cannot unregister 'world', 'self' or default communicators |
| if (comm_ == MPI_COMM_WORLD) { | ||
| throw SeriousBug("Cannot free MPI_COMM_WORLD communicator"); | ||
| } | ||
| if (comm_ == MPI_COMM_SELF) { | ||
| throw SeriousBug("Cannot free MPI_COMM_SELF communicator"); | ||
| } |
There was a problem hiding this comment.
I am not sure if SeriousBug is the right exception here, I don't have good overview of the flow that leads to this code. Please think about it for a moment and adjust it to UserError if you think its more appropriate, otherwise let it stay as is
There was a problem hiding this comment.
The flow to arrive here would be to call
eckit::mpi::deleteComm( comm_string )where comm_string would be 'world', 'self', or an alias to these. Unlike unregisterComm, an alias is not allowed here.
Having these guards allows us to try / catch and add a check in unit-test EXPECT_THROWS(...).
That is because MPI cannot recover from MPI_Comm_free(MPI_COMM_WORLD): it aborts and terminates on all ranks, even though it is guarded with the MPI_CALL( ... ) macro.
Description
eckit_mpi contains a registry of MPI communicators to be accessed by name.
Currently it is only possible to remove registered communicators with the function
eckit::mpi::deleteComm(name).This function has the side-effect to also call
MPI_Comm_free. That is OK for communicators that are strictly managed by eckit, but not when the MPI communicators are managed by external library, and were added usingeckit::mpi::addComm(name, MPI_comm_integer_handle).Furthermore it is illegal to call
MPI_Comm_freeon some communicators equivalent toMPI_COMM_WORLD,MPI_COMM_SELF.With this PR I propose a function
eckit::mpi::unregisterComm(name), which is similar toeckit::mpi::deleteComm(name)but which does not callMPI_Comm_free.I extended the unit-test
eckit_test_mpi_addcommto illustrate the existing problem witheckit::mpi::deleteComm(name), and test the new functionality.Note that I added
ECKIT_MPI_FORCE=serialhardcoded in the test. This should be harmless because ctest should have been calling this test with MPI already. This is added for safety, when just calling the test directly on command-line (e.g. on Mac) without mpiexec.My use-case here is when interoperating Atlas with libraries and programs that have already set-up MPI communicators,
and needing to clean-up.
Contributor Declaration
By opening this pull request, I affirm the following:
🌦️ >> Documentation << 🌦️
https://sites.ecmwf.int/docs/dev-section/eckit/pull-requests/PR-260