-I've designed various driver installers for previous work, including [infiniband on AKS](https://github.com/converged-computing/aks-infiniband-install) and more experimental ones like [deploying a Flux instance alongside the Kubelet](https://github.com/converged-computing/flux-distribute). NVIDIA GPU drivers are typically installed in a similar fashion, in the simplest case with [nvidia device plugin](https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml) but now NVIDIA has exploded their software and Kubernetes tooling so everything but the kitchen sink is installed with the [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide). Getting this working in user-space was uncharted territory, because we had two layers to work through - first the physical node to the rootless docker node (control plane or kubelet) and then from that node to the container in a pod deployed by containerd. Even just for the case of one layer of abstraction, I found many unsolved issues on GitHub and no single source of truth for how to do it. Needless to say, I wasn't sure about the complexity that would be warranted to get this working, or if I could do it at all.
0 commit comments