NVIDIA Maintenance Operator provides Kubernetes API(Custom Resource Definition) to allow node maintenance operators in K8s cluster in a coordinated manner. It performs some common operations to prepare a node for maintenance such as cordoning the node as well as draining it.
Users/Consumers can request to perform maintenance on a node by creating NodeMaintenance Custom Resource(CR). The operator will then reconcile NodeMaintenance CRs. At high level this the the reconcile flow:
- Scheduling - schedule NodeMaintenance to be processed by the operator, taking into account constraints such as the maximal allowed parallel operations.
- Node preparation for maintenance such as cordon and draning of the node
- Mark NodeMaintenance as Ready (via condition)
- Cleanup on deletion of NodeMaintenance such as node uncordon
- Kubernetes cluster
# Clone project
git clone https://github.com/Mellanox/maintenance-operator.git ; cd maintenance-operator
# Install Operator
helm install -n maintenance-operator --create-namespace --set operator.image.tag=latest maintenance-operator ./deployment/maintenance-operator-chart
# View deployed resources
kubectl -n maintenance-operator get all
Note
Refer to helm values documentation for more information
helm install -n maintenance-operator --create-namespace maintenance-operator oci://ghcr.io/mellanox/maintenance-operator-chart
# clone project
git clone https://github.com/Mellanox/maintenance-operator.git ; cd maintenance-operator
# build image
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make docker-build
# push image
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make docker-push
# deploy
IMG=harbor.mellanox.com/cloud-orchestration-dev/adrianc/maintenance-operator:latest make deploy
# undeploy
make undeploy
The MaintenanceOperatorConfig CRD is used for operator runtime configuration
for more information refer to api-reference
apiVersion: maintenance.nvidia.com/v1alpha1
kind: MaintenanceOperatorConfig
metadata:
name: default
namespace: maintenance-operator
spec:
logLevel: info
maxParallelOperations: 4
In this example we configure the following for the operator:
- Log level (
logLevel
) is set toinfo
- The max number of parallel maintenance operations (
maxParallelOperations
) is set to4
The NodeMaintenance CRD is used to request to perform a maintenance operation on a specific K8s node. In addition, it specifies which common (K8s related operations) need to happend in order to preare a node for maintenance.
Once the node is ready for maintenance the operator will set Ready
condition in status
field to True
After maintenance operation was done by the requestor, NodeMaintenance CR should be deleted to finish the maintenance operation.
for more information refer to api-reference
apiVersion: maintenance.nvidia.com/v1alpha1
kind: NodeMaintenance
metadata:
name: my-maintenance-operation
namespace: default
spec:
requestorID: some.one.acme.com
nodeName: wokrer-01
cordon: true
waitForPodCompletion:
podSelector: "app=important"
timeoutSeconds: 0
drainSpec:
force: true
podSelector: ""
timeoutSeconds: 0
deleteEmptyDir: true
podEvictionFilters:
- byResourceNameRegex: nvidia.com/gpu-*
- byResourceNameRegex: nvidia.com/rdma*
In this example we request to perform maintenance for node worker-1
.
the following steps will occur before the node is marked as ready for maintenance:
- cordon of
worker-1
node - waiting for pods with
app: important
label to finish - draining of
worker-1
with the provideddrainSpec
- force draining of pods even if they dont belong to a controller
- allow draining of pods with emptyDir mount
- only drain pods that consume either
nvidia.com/gpu-*
,nvidia.com/rdma*
resources
once the node is ready for maintenance Ready
condition will be True
$ kubectl get nodemaintenances.maintenance.nvidia.com -A
NAME NODE REQUESTOR READY PHASE FAILED
my-maintenance-operation worker-01 some.one.acme.com True Ready
stateDiagram-v2
pending: maintenance request registered, waiting to be scheduled
scheduled: maintenance request scheduled
cordon: cordon node
waitForPodCompletion: wait for specified pods to complete
draining: node draining
ready: node ready for maintenance
requestorFailed: requestor failed the maintenance operations
[*] --> pending : NodeMaintenance created
pending --> scheduled : scheduler selected NodeMaintenance for maintenance, add finalizer
scheduled --> cordon : preparation for cordon completed
cordon --> waitForPodCompletion : cordon completed
waitForPodCompletion --> draining : finished waiting for pods
draining --> ready : drain operation completed successfully, node is ready for maintenance, Ready condition is set to True
ready --> requestorFailed : requestor has set RequestorFailed condition
pending --> [*] : object deleted
scheduled --> [*] : object deleted
cordon --> [*] : object marked for deletetion, cleanup before deletion
waitForPodCompletion --> [*] : object marked for deletetion, cleanup before deletion
draining --> [*] : object marked for deletetion, cleanup before deletion
ready --> [*] : object marked for deletetion, cleanup before deletion
requestorFailed --> [*] : RequestorFailed condition cleared by requestor or external user, object marked for deletion, cleanup before deletion
The maintenance operator uses the following scheduling algorithm to choose the next set of nodes to perform maintenance on.
The number of slots available for node maintenance scheduling is determined as follows:
- Get the value of
MaintenanceOperatorConfig.Spec.MaxParallelOperations
- This can be an absolute number (e.g.,
5
) or a percentage of total nodes (e.g.,"10%"
)
- This can be an absolute number (e.g.,
- Get the current number of nodes that are under maintenance (have a
NodeMaintenance
CR which was already scheduled) - The number of available slots is determined by subtracting 2. from 1.
Let's mark this value as availableSlots
.
The scheduler respects cluster availability limits by not exceeding MaintenanceOperatorConfig.Spec.MaxUnavailable
.
- Calculate the absolute number of nodes that can become unavailable in the cluster based on
MaintenanceOperatorConfig.Spec.MaxUnavailable
- This can be an absolute number (e.g.,
3
) or a percentage of total nodes (e.g.,"20%"
) - If unspecified, no limit is applied
- This can be an absolute number (e.g.,
- Determine the current number of unavailable nodes by summing:
- The number of nodes that have
NodeMaintenance
CR in progress - The number of nodes that are unschedulable or not ready
- The number of nodes that have
- Determine the number of additional nodes that can become unavailable by subtracting 2. from 1.
Let's mark this value as canBecomeUnavailable
.
These are nodes that are targeted (via NodeMaintenance.Spec.NodeName
) by NodeMaintenance
objects that are pending (not yet scheduled).
Note
A node that is already targeted by a NodeMaintenance
object in progress will not be part of this list.
These are all NodeMaintenance
objects that are targeting one of the candidate nodes from Step 3 above and are in the Pending
state.
Each candidate NodeMaintenance
is ranked using the following criteria (ordered by priority):
- Prioritizes requestors that already have
NodeMaintenance
objects in progress - Prioritizes requestors with fewer pending
NodeMaintenance
objects - Prioritizes by creation time (older > newer)
Higher-ranked objects are scheduled first.
The scheduler selects NodeMaintenance
objects for scheduling up to availableSlots
(from Step 1), while ensuring that:
- No more than
canBecomeUnavailable
additional nodes become unavailable (from Step 2) - If a target node is already unavailable, it doesn't count against the
canBecomeUnavailable
limit - Only one
NodeMaintenance
object per node is scheduled (highest-ranked wins if multiple exist)
Example: With availableSlots=3
and canBecomeUnavailable=1
:
- If 2 requests target already-unavailable nodes and 1 targets an available node → all 3 can be scheduled
- If 3 requests target available nodes → only 1 can be scheduled
- Cluster: 10 nodes (all available)
- Config:
MaxParallelOperations=2
,MaxUnavailable=5
- Pending: 5 maintenance requests
- Result: 2 requests scheduled (limited by MaxParallelOperations)
- Cluster: 10 nodes (2 already unavailable)
- Config:
MaxParallelOperations=5
,MaxUnavailable=3
- Pending: 3 maintenance requests (all targeting available nodes)
- Result: 1 request scheduled (limited by MaxUnavailable: 3-2=1)
To prevent the maintenance operator from scheduling maintenance on additional nodes that would become unavailable, set the spec.maxUnavailable
field of MaintenanceOperatorConfig to zero.
Note
This allows maintenance to continue on nodes that are already unavailable, but prevents new nodes from becoming unavailable.
Example:
kubectl -n maintenance-operator patch maintenanceoperatorconfigs.maintenance.nvidia.com default --type merge --patch '{"spec": {"maxUnavailable": 0}}'
To completely stop all maintenance operator activity (including ongoing operations), scale down the maintenance operator deployment to zero:
Example:
kubectl scale --replicas 0 -n maintenance-operator deployment maintenance-operator
To resume maintenance operator reconciliation, scale up the deployment:
Example:
kubectl scale --replicas 1 -n maintenance-operator deployment maintenance-operator