Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster

An AWS ParallelCluster release comes with a set of AMIs for the supported operating systems and EC2 platforms. Each AMI contains a software stack, including the NVIDIA Drivers, that has been validated at ParallelCluster release time.
It’s likely that other versions of the NVIDIA Drivers can successfully work with the rest of the software stack but technical support will be limited.
If you wish to upgrade the NVIDIA GPU Driver on your cluster you can follow this guide.

Creating a Custom AMI With an Upgraded NVIDIA GPU Driver and CUDA Version

To upgrade the NVIDIA GPU Driver and CUDA version, it is advised to create a new custom AMI with the new versions via the pcluster build-image command.

After having successfully built the custom AMI, you can use this AMI for a new cluster, or update your compute nodes of a running cluster by using the Scheduling/SlurmQueues/Queue/Image/CustomAmi cluster configuration parameter and launching a pcluster update-cluster command. Once the update is applied and the compute nodes have been started with the new custom AMI, please verify that the new version of the driver is installed by launching the nvidia-smi command.

To build the custom AMI, you need to provide to the image configuration file a custom component which upgrades both the NVIDIA and CUDA versions. Here is a configuration snippet with the custom component:

Image:
# Due to the large size of files, make sure to have a large enough root volume size.
  RootVolume:
    Size: 50
Build:
  InstanceType: g4dn.xlarge  # instance type with NVIDIA GPUs
  ParentImage: ami-04823729c75214919  # base AMI of your desired OS, e.g. alinux2
  Components:
   - Type: arn
     Value: arn:{{PARTITION}}:imagebuilder:{{REGION}}:{{ACCOUNT_ID}}:component/nvidiacudainstall/1.0.0/1

The following component should be used for your custom component (The Nvidia driver version, CUDA version, and architecture can be adapted to your needs):

name: NvidiaAndCudaInstall
description: Install nvidia and cuda
schemaVersion: 1.0

phases:
  - name: build
    steps:
    - name: InstallNvida
      action: ExecuteBash
      inputs:
        commands:
          - |
            #!/bin/bash
            set -ex
            
            NVIDIA_DRIVER_VERSION="580.95.05"
            ARCH="x86_64"
            
            # Create temporary directory
            TMP_DIR="/pcluster-tmp/$(date +"%Y-%m-%dT%H-%M-%S")"
            
            COMPILER_PATH="/usr/bin/gcc"
            export CC="${COMPILER_PATH}"
            
            NVIDIA_RUNFILE="NVIDIA-Linux-${ARCH}-${NVIDIA_DRIVER_VERSION}.run"
            wget -P "${TMP_DIR}" "https://us.download.nvidia.com/tesla/${NVIDIA_DRIVER_VERSION}/${NVIDIA_RUNFILE}"
            chmod +x "${TMP_DIR}/${NVIDIA_RUNFILE}"
            "${TMP_DIR}/${NVIDIA_RUNFILE}" --silent --dkms --disable-nouveau -m="kernel-open"
            
            # Cleanup
            rm -rf "${TMP_DIR}"
            
    - name: InstallCuda
      action: ExecuteBash
      inputs:
        commands:
          - |
            #!/bin/bash
            set -ex
            
            CUDA_VERSION="13.0.2"
            CUDA_SAMPLES_VERSION="13.0"
            CUDA_RELEASE_NVIDIA_VERSION="580.95.05"
            
            # Create temporary directory
            TMP_DIR="/pcluster-tmp/$(date +"%Y-%m-%dT%H-%M-%S")"
            
            CUDA_RUNFILE="cuda_${CUDA_VERSION}_${CUDA_RELEASE_NVIDIA_VERSION}_linux.run"
            wget -P "${TMP_DIR}" "https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/${CUDA_RUNFILE}"
            chmod +x "${TMP_DIR}/${CUDA_RUNFILE}"
            CUDA_TMP_INSTALL_DIR="${TMP_DIR}/cuda-install"
            mkdir -p "${CUDA_TMP_INSTALL_DIR}"
            "${TMP_DIR}/${CUDA_RUNFILE}" --silent --toolkit --samples --tmpdir="${CUDA_TMP_INSTALL_DIR}"
            
            CUDA_SAMPLES_ARCHIVE="v${CUDA_SAMPLES_VERSION}.tar.gz"
            wget -P "${TMP_DIR}" "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v${CUDA_SAMPLES_VERSION}.tar.gz"
            tar xf "${TMP_DIR}/${CUDA_SAMPLES_ARCHIVE}" --directory "/usr/local/"
            
            # Cleanup
            rm -rf "${TMP_DIR}"
            
            ## Add CUDA to PATH
            CUDA_PATH="/usr/local/cuda"
            echo "export PATH=${CUDA_PATH}/bin:\${PATH}" > /etc/profile.d/pcluster_cuda.sh
            echo "export LD_LIBRARY_PATH=${CUDA_PATH}/lib64:\${LD_LIBRARY_PATH}" >> /etc/profile.d/pcluster_cuda.sh
            chmod +x /etc/profile.d/pcluster_cuda.sh
            
    - name: Validation
      action: ExecuteBash
      inputs:
        commands:
          - |
            #!/bin/bash
            set -ex
            
            ## Validation
            source /etc/profile.d/pcluster_cuda.sh
            ls -l /usr/local
            which nvcc
            nvcc --version
            which nvidia-smi
            nvidia-smi

Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster

Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster

Creating a Custom AMI With an Upgraded NVIDIA GPU Driver and CUDA Version

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!