Add DRA driver for IMEX #1143

cdesiniotis · 2024-11-28T00:35:13Z

No description provided.

elezar · 2024-11-28T09:10:14Z

api/nvidia/v1/clusterpolicy_types.go

@@ -841,6 +843,60 @@ type SandboxDevicePluginSpec struct {
 	Env []EnvVar `json:"env,omitempty"`
 }

+// DRADriverSpec defines the properties for the NVIDIA DRA Driver deployment
+// TODO: add 'controller' and 'kubeletPlugin' structs to allow for per-component configuration


One question: Should we expose controller and kubeletPlugin as concepts to the user? These seem like internal details and including them here couples the operator and the DRA driver implementation more tightly.

I don't believe we should at this point in time. But I can see where having the ability to control the controller / kubeletPlugin configuration independently could be useful. For example, one may need to bump the cpu / mem resources for the controller (and not the plugin) to account for larger sized clusters. Open to continue discussion on this and enumerate the list of fields we want to expose in Clusterpolicy.

elezar · 2024-11-28T09:10:49Z

assets/state-dra-driver/0200_clusterrole.yaml

+metadata:
+  name: nvidia-dra-driver
+rules:
+# TODO: restrict RBAC for DRA driver


Has this been done upstream yet?

Not yet, see https://github.com/NVIDIA/k8s-dra-driver/blob/main/deployments/helm/k8s-dra-driver/templates/clusterrole.yaml

Issue tracker: https://github.com/NVIDIA/cloud-native-team/issues/134

I see @guptaNswati is working on restricting the RBAC upstream: NVIDIA/k8s-dra-driver#219. I will pull in those changes once they are finalized.

assets/state-dra-driver/0400_deviceclass-imex.yaml

elezar · 2024-11-28T09:13:38Z

assets/state-dra-driver/0500_deployment.yaml

+      - name: controller
+        image: "FILLED BY THE OPERATOR"
+        imagePullPolicy: IfNotPresent
+        command: ["nvidia-dra-controller", "-v", "6"]


Should we be setting the verbosity here?

I don't have a strong opinion. I just copied the flags passed in the upstream DRA helm chart: https://github.com/NVIDIA/k8s-dra-driver/blob/32805fec62d6b3269e75ddb0eddeaa5c4d214564/deployments/helm/k8s-dra-driver/templates/controller.yaml#L55

elezar · 2024-11-28T09:18:26Z

assets/state-dra-driver/0600_configmap.yaml

+    set -o allexport
+    cat /run/nvidia/validations/driver-ready
+    . /run/nvidia/validations/driver-ready
+    # TODO: add an alias for DRIVER_ROOT_CTR_PATH in the k8s-dra-driver and remove the below export


Created NVIDIA/k8s-dra-driver#209

elezar · 2024-11-28T09:20:51Z

assets/state-dra-driver/0700_daemonset.yaml

+        - name: DEVICE_CLASSES
+          value: imex


Is it expected that the DEVICE_CLASSES for the controller and the plugin match?

We can set from one of these gpu, mig and imex and plugin will automatically pick it up

name: DEVICE_CLASSES
value: {{ .Values.deviceClasses | join "," }}

here, it seems we are only doing IMEX by default.

elezar · 2024-11-28T09:21:27Z

assets/state-dra-driver/0700_daemonset.yaml

+        - name: MASK_NVIDIA_DRIVER_PARAMS
+          value: "false"


Question: Can a user still override this if required?

Yes. The user can override this variable in Clusterpolicy with draDriver.env.

elezar · 2024-11-28T09:22:46Z

assets/state-dra-driver/0700_daemonset.yaml

+        - mountPath: /var/lib/kubelet/plugins_registry
+          name: plugins-registry
+        - mountPath: /var/lib/kubelet/plugins
+          mountPropagation: Bidirectional


Question: Is bidirectional mount propagation really needed in the plugins folder?

I am not sure. I copied this from the upstream DRA helm chart: https://github.com/NVIDIA/k8s-dra-driver/blob/32805fec62d6b3269e75ddb0eddeaa5c4d214564/deployments/helm/k8s-dra-driver/templates/kubeletplugin.yaml#L99

@klueska do you know if this is required?

assets/state-dra-driver/0700_daemonset.yaml

elezar · 2024-11-28T09:25:16Z

controllers/object_controls.go

@@ -147,6 +148,8 @@ const (
 	NvidiaCtrRuntimeCDIPrefixesEnvName = "NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES"
 	// CDIEnabledEnvName is the name of the envvar used to enable CDI in the operands
 	CDIEnabledEnvName = "CDI_ENABLED"
+	// NvidiaCTKHookPathEnvName is the name of the envvar specifying the path to the 'nvidia-ctk' binary


Comment is incorrect.

Note we shouldn't need the nvidia-ctk path in addition to the nvidia-cdi-hook path. They can be used interchangeably.

Updated the comment. Will remove this once NVIDIA/k8s-dra-driver#210 is complete.

elezar · 2024-11-28T09:26:59Z

controllers/object_controls.go

@@ -1539,6 +1543,55 @@ func TransformSandboxDevicePlugin(obj *appsv1.DaemonSet, config *gpuv1.ClusterPo
 	return nil
 }

+// TransformDRADriverPlugin transforms nvidia-dra-driver-plugin daemonset with required config as per ClusterPolicy
+func TransformDRADriverPlugin(obj *appsv1.DaemonSet, config *gpuv1.ClusterPolicySpec, n ClusterPolicyController) error {


Out of scope for this PR: How much merit is there in refactoring these Transform* functions to strip out the common logic that is performed for all containers?

There is definitely some merit. We currently apply some common transforms for all DaemonSets here before calling the individual Transform* functions:

gpu-operator/controllers/object_controls.go

Lines 718 to 729 in 58b1954

// apply common Daemonset configuration that is applicable to all

err := applyCommonDaemonsetConfig(obj, &n.singleton.Spec)

if err != nil {

logger.Error(err, "Failed to apply common Daemonset transformation", "resource", obj.Name)

return err

}

// transform the host-root and host-dev-char volumes if a custom host root is configured with the operator

transformForHostRoot(obj, n.singleton.Spec.HostPaths.RootFS)

// transform the driver-root volume if a custom driver install dir is configured with the operator

transformForDriverInstallDir(obj, n.singleton.Spec.HostPaths.DriverInstallDir)

We could strip out more, but I believe the main issue is that each Transform* function is reading configuration from a different data type. E.g. TransformDRADriverPlugin reads from a struct of type DRADriverSpec while TransformDriver reads from a struct of type DriverSpec.

elezar · 2024-11-28T09:30:38Z

controllers/object_controls.go

+	}
+
+	if config.Toolkit.IsEnabled() {
+		setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCTKPathEnvName, filepath.Join(config.Toolkit.InstallDir, "toolkit/nvidia-ctk"))


Let's rather update the driver to also use the nvidia-cdi-hook path. The following should be sufficient:

Suggested change

setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCTKPathEnvName, filepath.Join(config.Toolkit.InstallDir, "toolkit/nvidia-ctk"))

setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCTKPathEnvName, filepath.Join(config.Toolkit.InstallDir, "toolkit/nvidia-cdi-hook"))

Created NVIDIA/k8s-dra-driver#210 as a follow-up.

elezar · 2024-11-28T09:39:01Z

controllers/object_controls.go

+	return nil
+}
+
+func transformDeployment(obj *appsv1.Deployment, n ClusterPolicyController) error {


Question: Why is this not a function defined on a ClusterPolicyController?

No particular reason other than following the precedent we have with similar transform functions we have in this file. Is there a good reason for defining this as a method instead?

controllers/object_controls.go

controllers/resource_manager.go

guptaNswati · 2024-12-02T19:01:07Z

assets/state-dra-driver/0500_deployment.yaml

+        imagePullPolicy: IfNotPresent
+        command: ["nvidia-dra-controller", "-v", "6"]
+        env:
+        - name: DEVICE_CLASSES


are we only doing imex? not mig? we need to do some error handling in case when imex daemon is not running. rn we always expect the imex daemon to be running.

The scope of this PR is to only enable IMEX.

I am not sure what type of error handling you are envisioning, but I believe we need to ensure that the DRA driver daemonset only ever gets scheduled on nodes that are in an IMEX domain. As is, my PR deploys the DRA driver daemonset on all GPU nodes, regardless if they are in an IMEX domain. I can look into updating the nodeAffinity to leverage the IMEX domain label that GFD adds.

I have updated the nodeAffinity such that the IMEX DRA driver kubelet plugin only gets scheduled on nodes labeled with nvidia.com/gpu.imex-domain

guptaNswati · 2024-12-02T19:24:22Z

assets/state-dra-driver/0700_daemonset.yaml

+        - name: DEVICE_CLASSES
+          value: imex


We can set from one of these gpu, mig and imex and plugin will automatically pick it up

name: DEVICE_CLASSES
value: {{ .Values.deviceClasses | join "," }}

here, it seems we are only doing IMEX by default.

assets/state-dra-driver/0400_deviceclass-imex.yaml

guptaNswati · 2024-12-02T19:39:23Z

assets/state-dra-driver/0700_daemonset.yaml

+          type: DirectoryOrCreate
+      - name: driver-install-dir
+        hostPath:
+          path: "/run/nvidia/driver"


we are still doing host managed drivers for imex, so we need override this path to / with --set driver.enabled=false

We mount both /run/nvidia/driver and the host root / into the container, to account for both the driver container and host-managed driver scenarios. In the case of host-managed drivers, we set the DRIVER_CTR_ROOT_PATH envvar to /host (this is the container path) during the container entrypoint script.

guptaNswati · 2024-12-02T19:52:24Z

deployments/gpu-operator/values.yaml

@@ -294,6 +294,17 @@ devicePlugin:
    # MPS root path on the host
    root: "/run/nvidia/mps"

+draDriver:
+  enabled: true
+  repository: ghcr.io/nvidia


should we mirror it to nvcr.io similar to other images and may be retag with semvar versioning

Yes, this is a placeholder for now. We will update this once we publish the DRA driver image to nvcr.io

cdesiniotis · 2024-12-03T20:28:47Z

api/nvidia/v1/clusterpolicy_types.go

+	// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors=true
+	// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Resource Requirements"
+	// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced,urn:alm:descriptor:com.tectonic.ui:resourceRequirements"
+	Resources *ResourceRequirements `json:"resources,omitempty"`


Question -- should we include the resources, args, and env fields at this point in time?

Since the IMEX DRA Driver consists of a controller and kubeletPlugin, it feels as if I should update this so that users can configure each component independently.

controller: resources: {} env: [] kubeletPlugin: resources: {} env: []

Signed-off-by: Christopher Desiniotis <[email protected]>

guptaNswati · 2024-12-17T23:06:32Z

api/nvidia/v1/clusterpolicy_types.go

@@ -54,7 +54,7 @@ type ClusterPolicySpec struct {
 	// DevicePlugin component spec
 	DevicePlugin DevicePluginSpec `json:"devicePlugin"`
 	// DRADriver component spec
-	DRADriver DRADriverSpec `json:"draDriver"`
+	IMEXDRADriver IMEXDRADriverSpec `json:"imexDRADriver"`


This approach seems better.

cdesiniotis requested a review from klueska November 28, 2024 00:35