Skip to content

feat: Adding support for shared-requestor #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

heyvister1
Copy link
Contributor

@heyvister1 heyvister1 commented Jun 13, 2025

The motivation of the following code change is to utilize same node-maintenance object(s), to efficiently synchronize required node operations, by multiple requestors. It will help to reduce consecutive node operation cycles to a single operation cycle.

First requestor which creates node-maintenance obj is its owner, and will be in-charge of creating/deleting node-maintenance obj.
Second requestor will check for existing node-maintenance obj, if so, it will add itself under spec.additionalRequestors list.

Copy link

copy-pr-bot bot commented Jun 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@heyvister1 heyvister1 force-pushed the joint-requestor branch 4 times, most recently from 4b4c42c to e6c5941 Compare June 15, 2025 13:01
@@ -108,6 +108,21 @@ controller manager watchers:
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog,
requestorOpts.MaintenanceOPRequestorID))).
```

The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations:
Copy link
Collaborator

@adrianchiris adrianchiris Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any chance to modify the terminology to: shared-requestor or shared-node-maintenenace

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting this in a sub-section of requestor

Assumptions:
1. Cluster admin, which requires requestor joint mode, needs to make sure that all operators, utilizing maintenance OP, use same upgrade policy specs (same drainSpec).
2. To be able to accommodate both GPU/Network drivers upgrade, `DrainSpec.PodSelector` should be set accordingly (hard-coded).
* podSelector: `nvidia.com/ofed-driver-upgrade-drain.skip!=true,nvidia.com/gpu-driver-upgrade-drain.skip!=true`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure, drain pkg will treat this as an OR right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be treated as AND condition.
Check the following example (I have tested it)

1. pod "foo" - nvidia.com/ofed-driver-upgrade-drain.skip=true label
2. pod "bar" - nvidia.com/gpu-driver-upgrade-drain.skip=true label
3. pod "baz" - no label

Pod "foo" (has nvidia.com/ofed-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: FALSE AND TRUE = FALSE

Pod "bar" (has nvidia.com/gpu-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
Result: TRUE AND FALSE = FALSE

Pod "baz" (no labels):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: TRUE AND TRUE = TRUE

if _, exists := nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)]; !exists {
nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)] = trueString
}
err := m.K8sClient.Patch(ctx, nm, client.MergeFrom(originalNm))
Copy link
Collaborator

@adrianchiris adrianchiris Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you say in comments that optimistic lock should be used (rightfully so), but i dont see you using:

client.MergeFromWithOptions(originalNm, client.MergeFromWithOptimisticLock{})

in patch calls here an below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing in my latest commit

@@ -306,6 +324,114 @@ func (m *RequestorNodeStateManagerImpl) ProcessUpgradeRequiredNodes(

return nil
}
func (m *RequestorNodeStateManagerImpl) createOrUpdateNodeMaintenance(ctx context.Context,
nodeState *NodeUpgradeState) error {
// 1. check if nodeMaintenance obj exists.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these comments relevant in createOrUpdateNodeMaintenance ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need to move some of it to docs ?

@@ -108,6 +108,21 @@ controller manager watchers:
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog,
requestorOpts.MaintenanceOPRequestorID))).
```

The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting this in a sub-section of requestor

@@ -438,12 +557,12 @@ func GetRequestorOptsFromEnvs() RequestorOptions {
if os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID") != "" {
opts.MaintenanceOPRequestorID = os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID")
} else {
opts.MaintenanceOPRequestorID = "nvidia.operator.com"
opts.MaintenanceOPRequestorID = DefaultNodeMaintenanceRequestorID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not really a great ux IMO to enable this feature when the two env vars are unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be checking if DefaultNodeMaintenanceRequestorID is used before using share-requestor flow

@heyvister1 heyvister1 requested a review from adrianchiris June 19, 2025 05:59
@heyvister1 heyvister1 changed the title feat: Adding support for joint requestors feat: Adding support for shared-requestor Jun 19, 2025
@heyvister1
Copy link
Contributor Author

@cdesiniotis, @tariq1890 please help to review this PR

@heyvister1 heyvister1 force-pushed the joint-requestor branch 4 times, most recently from b062823 to 3eaf949 Compare June 19, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants