-
Notifications
You must be signed in to change notification settings - Fork 13
Fix/fdb 580 fdb hammer encapsulate barrier #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix/fdb 580 fdb hammer encapsulate barrier #194
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #194 +/- ##
===========================================
- Coverage 72.46% 72.45% -0.01%
===========================================
Files 360 360
Lines 21693 21707 +14
Branches 2240 2245 +5
===========================================
+ Hits 15719 15728 +9
- Misses 5974 5979 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/fdb5/tools/fdb-hammer.cc
Outdated
| max_wait_(max_wait), | ||
| server_connections_(nodes_.size() ? (nodes_.size() - 1) : 0) { | ||
|
|
||
| hostname_ = Main::hostname(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is done in the function body, not as an initialiser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, fixed
src/fdb5/tools/fdb-hammer.cc
Outdated
| eckit::PathName default_run_path{"/var/run/user"}; | ||
| default_run_path /= uid_to_str(uid); | ||
|
|
||
| eckit::PathName run_path(eckit::Resource<std::string>("$FDB_HAMMER_RUN_PATH", default_run_path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic feels like it should belong in the config object. And then the Barrier object probably should accept a config object as an argument.
At least, the determination of the run directory should. The specific files used matter less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the barrier code as potentially useful for other purposes than fdb-hammer. If so, it'd be better not to make it depend on fdb-hammer-specific configuration. What about adding a "run_path" argument in the Barrier constuctor, with default value "/var/run/user", which the fdb-hammer code can override with FDB_HAMMER_RUN_PATH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've renamed FDB_HAMMER_RUN_PATH to FDB_BARRIER_RUN_PATH, and renamed all fdb-hammer.* files (fifo and lock fiels) to barrier.*
src/fdb5/tools/fdb-hammer.cc
Outdated
| } | ||
| } | ||
|
|
||
| leader_found = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit odd that by default the loop doesn't loop. Unless we hit a continue somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully improved by removing the continue and setting leader_found = true where relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop was removed altogether while resolving the race condition in leftover PID file handling.
src/fdb5/tools/fdb-hammer.cc
Outdated
|
|
||
| /// the leadr PID is checked to exist. If not, clean up and | ||
| /// restart election procedure | ||
| if (::kill(leader_pid, 0) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this happen? Surely if the leader PID has died then the whole run is a failure as we no longer have the correct number of nodes?
Or is this code that allows a restart of the tests in the same directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to handle the case where barriering processes are abruply terminated without a chance to clean up the PID file.
src/fdb5/tools/fdb-hammer.cc
Outdated
| /// restart election procedure | ||
| if (::kill(leader_pid, 0) != 0) { | ||
| try { | ||
| pid_file_.unlink(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like there is a race condition.
The leader pid doesn't exist, so you prepare to unlink the file. But in the mean time another process has removed the file, and yet another process has become the leader and recreated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, now addressed with a file lock as you've suggested elsewhere.
| } | ||
| } | ||
|
|
||
| void Barrier::init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that this functionality is not called either (a) from the constructor, or (b) lazily from the barrier() function.
Generally best to avoid having an init_ member unless strictly necessary. And the only reason I can see that being necessary is to initialise lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not called from the constructor to avoid establishign TCP connections and so on as part of the constructor. There was no reason though for this to not be called lazily from barrier(). Now made private and called lazily.
src/fdb5/tools/fdb-hammer.cc
Outdated
| init_ = true; | ||
| } | ||
|
|
||
| void Barrier::fini() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fini() is not a sensible (English) name for the function.
I don't see why these methods shouldn't be in a destructor, to make the object inherently safe if the surrounding code calsl exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to finalise. This was not put in a destructor as it requires file removal (and potentially throwing) operations.
src/fdb5/tools/fdb-hammer.cc
Outdated
| if (is_leader_process_) { | ||
|
|
||
| /// a signal from each peer process in the node is received | ||
| std::vector<char> message = {'S', 'I', 'G'}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message doesn't need dynamic size, and allocation on the heap. Should be a std::array<char, 3>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/fdb5/tools/fdb-hammer.cc
Outdated
| // Barrier.block(); | ||
| barrier(config_.parallel.ppn, config_.parallel.nodelist, config_.parallel.barrierPort, | ||
| config_.parallel.barrierMaxWait); | ||
| Barrier barrier{(size_t)config_.parallel.ppn, config_.parallel.nodelist, config_.parallel.barrierPort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should be:
Barrier(config_).barrier();
Not sure that the init/fini functions add any value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, I'd keep the Barrier class independent from fdb-hammer specifics and finalise() seemed necessary.
src/fdb5/tools/fdb-hammer.cc
Outdated
| @@ -1188,9 +1269,11 @@ void FDBHammer::executeWrite() { | |||
| if (config_.execution.itt) { | |||
| eckit::Timer barrier_timer; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eckit::Timer can be inside the Barrier object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Description
Encapsulates the barrier code in fdb-hammer in a new class. The new code has been verified to work with an fdb-benchmark run in our HPC, mimicking 1 member on 3 writer nodes and 3 "pgen" jobs on 2 reader nodes each. Write ppn was set to 16 and read ppn was set to 32.
Contributor Declaration
By opening this pull request, I affirm the following:
🌈🌦️📖🚧 Documentation Z3FDB 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/z3fdb/pull-requests/PR-194
🌈🌦️📖🚧 Documentation FDB 🚧📖🌦️🌈
https://sites.ecmwf.int/docs/dev-section/fdb/pull-requests/PR-194