Skip to content

Conversation

@nicolau-manubens
Copy link
Contributor

@nicolau-manubens nicolau-manubens commented Nov 11, 2025

Description

Encapsulates the barrier code in fdb-hammer in a new class. The new code has been verified to work with an fdb-benchmark run in our HPC, mimicking 1 member on 3 writer nodes and 3 "pgen" jobs on 2 reader nodes each. Write ppn was set to 16 and read ppn was set to 32.

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

🌈🌦️📖🚧 Documentation Z3FDB 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/z3fdb/pull-requests/PR-194

🌈🌦️📖🚧 Documentation FDB 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/fdb/pull-requests/PR-194

@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 0% with 148 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.45%. Comparing base (24b9d9d) to head (d33d077).

Files with missing lines Patch % Lines
src/fdb5/tools/fdb-hammer.cc 0.00% 148 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #194      +/-   ##
===========================================
- Coverage    72.46%   72.45%   -0.01%     
===========================================
  Files          360      360              
  Lines        21693    21707      +14     
  Branches      2240     2245       +5     
===========================================
+ Hits         15719    15728       +9     
- Misses        5974     5979       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

max_wait_(max_wait),
server_connections_(nodes_.size() ? (nodes_.size() - 1) : 0) {

hostname_ = Main::hostname();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is done in the function body, not as an initialiser?

Copy link
Contributor Author

@nicolau-manubens nicolau-manubens Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, fixed

eckit::PathName default_run_path{"/var/run/user"};
default_run_path /= uid_to_str(uid);

eckit::PathName run_path(eckit::Resource<std::string>("$FDB_HAMMER_RUN_PATH", default_run_path));
Copy link
Contributor

@simondsmart simondsmart Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic feels like it should belong in the config object. And then the Barrier object probably should accept a config object as an argument.

At least, the determination of the run directory should. The specific files used matter less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the barrier code as potentially useful for other purposes than fdb-hammer. If so, it'd be better not to make it depend on fdb-hammer-specific configuration. What about adding a "run_path" argument in the Barrier constuctor, with default value "/var/run/user", which the fdb-hammer code can override with FDB_HAMMER_RUN_PATH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've renamed FDB_HAMMER_RUN_PATH to FDB_BARRIER_RUN_PATH, and renamed all fdb-hammer.* files (fifo and lock fiels) to barrier.*

}
}

leader_found = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd that by default the loop doesn't loop. Unless we hit a continue somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully improved by removing the continue and setting leader_found = true where relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop was removed altogether while resolving the race condition in leftover PID file handling.


/// the leadr PID is checked to exist. If not, clean up and
/// restart election procedure
if (::kill(leader_pid, 0) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? Surely if the leader PID has died then the whole run is a failure as we no longer have the correct number of nodes?

Or is this code that allows a restart of the tests in the same directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to handle the case where barriering processes are abruply terminated without a chance to clean up the PID file.

/// restart election procedure
if (::kill(leader_pid, 0) != 0) {
try {
pid_file_.unlink();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like there is a race condition.

The leader pid doesn't exist, so you prepare to unlink the file. But in the mean time another process has removed the file, and yet another process has become the leader and recreated it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now addressed with a file lock as you've suggested elsewhere.

}
}

void Barrier::init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this functionality is not called either (a) from the constructor, or (b) lazily from the barrier() function.

Generally best to avoid having an init_ member unless strictly necessary. And the only reason I can see that being necessary is to initialise lazily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not called from the constructor to avoid establishign TCP connections and so on as part of the constructor. There was no reason though for this to not be called lazily from barrier(). Now made private and called lazily.

init_ = true;
}

void Barrier::fini() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fini() is not a sensible (English) name for the function.

I don't see why these methods shouldn't be in a destructor, to make the object inherently safe if the surrounding code calsl exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to finalise. This was not put in a destructor as it requires file removal (and potentially throwing) operations.

if (is_leader_process_) {

/// a signal from each peer process in the node is received
std::vector<char> message = {'S', 'I', 'G'};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message doesn't need dynamic size, and allocation on the heap. Should be a std::array<char, 3>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// Barrier.block();
barrier(config_.parallel.ppn, config_.parallel.nodelist, config_.parallel.barrierPort,
config_.parallel.barrierMaxWait);
Barrier barrier{(size_t)config_.parallel.ppn, config_.parallel.nodelist, config_.parallel.barrierPort,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be:

Barrier(config_).barrier();

Not sure that the init/fini functions add any value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, I'd keep the Barrier class independent from fdb-hammer specifics and finalise() seemed necessary.

@@ -1188,9 +1269,11 @@ void FDBHammer::executeWrite() {
if (config_.execution.itt) {
eckit::Timer barrier_timer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eckit::Timer can be inside the Barrier object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants