This repository contains scripts to demonstrate Cedana features.
- A Kubernetes cluster with Cedana installed
- A GPU node with Nvidia GPUs
- Works with Ubuntu 22.04
- Install GPU drivers and CUDA toolkit: 12.4.1 is the latest version we support, as of writing this document, not required if you are using nvidia driver plugin to manage drivers.
# setup nvidia drivers and cuda + toolkit
# use the link to use runtime or deb files on ubuntu:
# https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04
- Setup K3s on the cluster (we will set it up as the root user to avoid permission issues):
# install k3s using k3sup
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/
# install k3s
k3sup install --local
# you can pick the channel for k8s version and docker for the container runtime as well, but it's
not useful for this demo
- Setup Cedana using the helm chart (with the given images, the changes haven't been merged into main at the time of writing this document):l
# use kueue rbac branch if you want to use kueue features
git clone https://github.com/cedana/cedana-helm-charts --branch feat/kueue-rbac
cd cedana-helm-charts
sudo KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install cedana ./cedana-helm --create-namespace -n cedana-systems \
--set daemonHelper.image.repository=cedana/cedana-helper-test \
--set daemonHelper.image.tag=feat-upload-bucket \
--set controllerManager.manager.image.repository=cedana/cedana-controller-test \
--set controllerManager.manager.image.tag=feat-kueue \
--set cedanaConfig.cedanaUrl="https://sandbox.cedana.ai" \
--set cedanaConfig.cedanaAuthToken="492de5eb9ebe78c4332b7cd1586f5dd52722397369ca58688009fd728f6946e78cbf51cdb9842e1e4d0621b84aa60810"
4.5. Additionally
# on k3s, after the install of helper pod and cedana is done,
# please restart your k3s instance to make cedana runtime available.
# use:
systemctl restart k3s
# this will reload our runtime configs, updated by the helm chart we just installed (ideally this
# would happen automatically, but it requires some changes to how we run background services to
# perform without unwanted disruptions, and issues)
# Also ensure all pods on k3s get restarted properly, restart them again if they are in unknown state