-
Notifications
You must be signed in to change notification settings - Fork 19
feat: Adding support for shared-requestor #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4b4c42c
to
e6c5941
Compare
docs/automatic-ofed-upgrade.md
Outdated
@@ -108,6 +108,21 @@ controller manager watchers: | |||
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog, | |||
requestorOpts.MaintenanceOPRequestorID))). | |||
``` | |||
|
|||
The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any chance to modify the terminology to: shared-requestor
or shared-node-maintenenace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider putting this in a sub-section of requestor
Assumptions: | ||
1. Cluster admin, which requires requestor joint mode, needs to make sure that all operators, utilizing maintenance OP, use same upgrade policy specs (same drainSpec). | ||
2. To be able to accommodate both GPU/Network drivers upgrade, `DrainSpec.PodSelector` should be set accordingly (hard-coded). | ||
* podSelector: `nvidia.com/ofed-driver-upgrade-drain.skip!=true,nvidia.com/gpu-driver-upgrade-drain.skip!=true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just making sure, drain pkg will treat this as an OR right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be treated as AND condition.
Check the following example (I have tested it)
1. pod "foo" - nvidia.com/ofed-driver-upgrade-drain.skip=true label
2. pod "bar" - nvidia.com/gpu-driver-upgrade-drain.skip=true label
3. pod "baz" - no label
Pod "foo" (has nvidia.com/ofed-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: FALSE AND TRUE = FALSE
Pod "bar" (has nvidia.com/gpu-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
Result: TRUE AND FALSE = FALSE
Pod "baz" (no labels):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: TRUE AND TRUE = TRUE
pkg/upgrade/upgrade_requestor.go
Outdated
if _, exists := nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)]; !exists { | ||
nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)] = trueString | ||
} | ||
err := m.K8sClient.Patch(ctx, nm, client.MergeFrom(originalNm)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you say in comments that optimistic lock should be used (rightfully so), but i dont see you using:
client.MergeFromWithOptions(originalNm, client.MergeFromWithOptimisticLock{})
in patch calls here an below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing in my latest commit
pkg/upgrade/upgrade_requestor.go
Outdated
@@ -306,6 +324,114 @@ func (m *RequestorNodeStateManagerImpl) ProcessUpgradeRequiredNodes( | |||
|
|||
return nil | |||
} | |||
func (m *RequestorNodeStateManagerImpl) createOrUpdateNodeMaintenance(ctx context.Context, | |||
nodeState *NodeUpgradeState) error { | |||
// 1. check if nodeMaintenance obj exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these comments relevant in createOrUpdateNodeMaintenance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe need to move some of it to docs ?
docs/automatic-ofed-upgrade.md
Outdated
@@ -108,6 +108,21 @@ controller manager watchers: | |||
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog, | |||
requestorOpts.MaintenanceOPRequestorID))). | |||
``` | |||
|
|||
The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider putting this in a sub-section of requestor
@@ -438,12 +557,12 @@ func GetRequestorOptsFromEnvs() RequestorOptions { | |||
if os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID") != "" { | |||
opts.MaintenanceOPRequestorID = os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID") | |||
} else { | |||
opts.MaintenanceOPRequestorID = "nvidia.operator.com" | |||
opts.MaintenanceOPRequestorID = DefaultNodeMaintenanceRequestorID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not really a great ux IMO to enable this feature when the two env vars are unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be checking if DefaultNodeMaintenanceRequestorID
is used before using share-requestor
flow
e6c5941
to
ee07667
Compare
@cdesiniotis, @tariq1890 please help to review this PR |
b062823
to
3eaf949
Compare
Signed-off-by: Ido Heyvi <[email protected]>
3eaf949
to
246770e
Compare
The motivation of the following code change is to utilize same node-maintenance object(s), to efficiently synchronize required node operations, by multiple requestors. It will help to reduce consecutive node operation cycles to a single operation cycle.
First requestor which creates node-maintenance obj is its owner, and will be in-charge of creating/deleting node-maintenance obj.
Second requestor will check for existing node-maintenance obj, if so, it will add itself under
spec.additionalRequestors
list.