Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: only mark nodes lost that transition from online to offline #6676

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 1, 2025

This PR fixes #6675 and adds a set of tests to ensure torpid/lively events don't cause resource removal (shrink).

Thanks for finding this one @garlick!

grondo added 2 commits March 1, 2025 15:19
Problem: The type of event being processed in post_join_leave() when
adjusting monitor->lost (the set of "lost" ranks) is not checked. This
means that any time `post_join_leave()` is called, the `lost` idset is
adjusted, which is incorrect for all events except "online/offline".
The net effect now is that ranks added to the `torpid` group are
removed from `lost`, and ranks that become `lively` are added to it.
A potentially confusing situation!

Only adjust `monitor->lost` when the `monitor->up` idset is being
modified, i.e. when `oldset == monitor->up`

Fixes flux-framework#6675
Problem: No tests ensures tha torpid/lively events do not affect
detection of lost ranks in batch and alloc jobs.

Add a set of tests to t2317-resource-shrink.t that ensure a node going
from lively to torpid and back again doesn't cause the resources in
a subinstance to shrink.
Copy link

codecov bot commented Mar 1, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (5ee36b4) to head (baa4c0e).

Files with missing lines Patch % Lines
src/modules/resource/monitor.c 75.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6676   +/-   ##
=======================================
  Coverage   83.86%   83.86%           
=======================================
  Files         534      534           
  Lines       88929    88930    +1     
=======================================
+ Hits        74577    74583    +6     
+ Misses      14352    14347    -5     
Files with missing lines Coverage Δ
src/modules/resource/monitor.c 71.88% <75.00%> (+0.13%) ⬆️

... and 6 files with indirect coverage changes

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

resource: can a rank be "lost" when it becomes un-torpid?
2 participants