Skip to content

vanderbilt-data-science/dgx-user-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 

Repository files navigation

Vanderbilt Data Science Institute - DGX A100 User Guide

A comprehensive guide to using the DGX A100 systems for authorized users.

The Data Science Institute (DSI) has four DGX A100 systems, now integrated into ACCRE's cluster. Access is provided to participants of DSI projects or those awarded a DSI Compute Grant for DGX.

Infrastructure

The DSI maintains 4 DGX A100 systems, available in two configurations:

  1. 8 x 40GB A100 GPUs (2 Machines, up to 320GB total GPU RAM each).
  2. 8 x 80GB A100 GPUs (2 Machines, up to 640GB total GPU RAM each).

These machines are interconnected via InfiniBand for multi-GPU and multi-node High-Performance Computing (HPC). Access to the GPUs is allocated on a first-come, first-served basis.

Resource Allocation

  • The DGX systems are shared among DSI Graduate Students, Faculty, Staff, and affiliated lab groups.
  • Job and resource management is handled via SLURM, a dynamic job scheduler.
  • High-demand periods or large resource requests may increase wait times.

Data Management

  • All work is saved to your ACCRE home directory.
  • Upon logging in, you will start in your ACCRE home directory. If you require additional storage, please reach out to us or ACCRE for a custom solution

Containers

  • Custom Singularity containers are supported.
  • Docker is not available due to security concerns.

For details on access methods, see Accessing the DGXs.

Setup

Requesting an Account

To use the DGX systems, you must request an account by completing the DSI Compute Grant for DGX. If you have recieved an email stating you've been provisioned access, you do not need to complete this form.

Accessing the DGXs

There are four primary methods to access the DGX systems:

  • Jupyter Notebooks
  • ACCRE GPU Desktop
  • salloc
  • SLURM batch jobs

Confirming Access

The slurm_resources command will show you what resources you can use. Under Account, you should see dsi_dgx as an option unless you have a different research group that has been provisioned with DGX Access. If you are a DSI student, you should also see p_dsi. Please reach out to Umang Chaudhry if you do not see either of these accounts.

  • Scroll to section regarding Accounts and QOS for accessing the interactive GPU partition.

    • You should see dsi_dgx_iacc under Accounts and dgx_iacc under QOS. Make a note of these two values as you will need them request resources.

Jupyter Notebooks

Jupyter Notebooks offer a straightforward way to access GPUs, though this method is limited to notebook-based workflows. For custom applications or containers, consider using the salloc method.

  1. Visit the ACCRE Visualization Portal: http://viz.accre.vu. Log in with your VUnetID and password.
  2. Select Interactive Apps.
  3. Choose ACCRE JupyterLab.
  4. Provide the duration of your session in hours
  5. Provide your ACCRE SLURM account (dsi_dgx_iacc)
  6. Select interactive_gpu (GPU accelerated nodes, ready on-demand) as the Partition
  7. Provide your QOS (Quality of Service) designation (dgx_iacc)
  8. Optionally provide the memory and number of CPU cores you require. If nothing provided, it will default to the specifications of the GPU you request
  9. Specify required GPU type - Nvidia A 100-SXM4 (DGX 80 GB) or Nvidia A 100-SXM4 (DGX 40 GB)
  10. Provide the number of GPUs you require
  11. If using a custom virtual environment or container, provide the necessary information under Advanced Options
  12. Launch the session. Your session will queue and begin based on resource availability.
image

ACCRE GPU Desktop

ACCRE GPU Desktop offers a virtual desktop environment for interactive GPU workflows.

  1. Visit the ACCRE Visualization Portal: http://viz.accre.vu. Log in with your VUnetID and password.
  2. Select Interactive Apps.
  3. Choose ACCRE GPU Desktop.
  4. Provide the duration of your session in hours
  5. Provide your ACCRE SLURM account (dsi_dgx_iacc)
  6. Select interactive_gpu (GPU accelerated nodes, ready on-demand) as the Partition
  7. Provide your QOS (Quality of Service) designation (dgx_iacc)
  8. Optionally provide the memory and number of CPU cores you require. If nothing provided, it will default to the specifications of the GPU you request
  9. Specify required GPU type - Nvidia A 100-SXM4 (DGX 80 GB) or Nvidia A 100-SXM4 (DGX 40 GB)
  10. Provide the number of GPUs you require
  11. Optionally, provide a custom screen resolution.
  12. Launch the session. Your session will queue and start based on availability.
image

salloc

The salloc method provides direct shell access to the DGX systems and is ideal for running custom applications or workflows.

  1. Open a terminal and run:
    ssh <VUnetID>@login.accre.vu
  2. Enter your VUnetID password.
  3. Navigate your ACCRE home directory using ls.
  4. Request a direct shell into the DGX system with the following command:
    salloc --time=1:00:00 --partition=interactive_gpu --account=dsi_dgx_iacc --qos=dgx_iacc --gres=gpu:nvidia_a100-sxm4-40gb:1
    • For 80GB GPUs, use:
      --gres=gpu:nvidia_a100-sxm4-80gb:1
    • Adjust the time and GPU count as needed.
  5. Use nvidia-smi to verify your resources.
  6. Launch your workflows using Singularity containers.

Running Jupyter notebooks via salloc

Running Jupyter notebooks from within a container requires a few extra steps due to the need for port forwarding:

  1. Open a terminal and run:
    ssh <VUnetID>@login.accre.vu
  2. Enter your VUnetID password.
  3. Navigate your ACCRE home directory using ls.
  4. Request a direct shell into the DGX system with the following command:
    salloc --time=1:00:00 --partition=interactive --account=dsi_dgx_iacc --qos=dgx_iacc --gres=gpu:nvidia_a100-sxm4-40gb:1
  5. Make note of the machine you landed on (dgx01, dgx02, dgx03, or dgx04)
  6. Navigate to the location of your singularity container
  7. Run the following command: singularity exec --nv --bind /home/vuNetID:/home/vuNetID pytorch_25.01-py3.sif jupyter-lab --notebook-dir=/home/vuNetID --ip=0.0.0.0 --no-browser. This starts a Jupyter Lab session with the workspace bound to your home directory. You can modify this to work off of any directory of your choice. Ensure you have Read, Write and Execute access to this directory.
  8. Open a NEW terminal window. Keep your previous terminal open and running.
  9. Run the following: ssh [email protected] -L 8888:<dgx03>:8888 (Replace dgx03 with your machine from step 5)
  10. Now, copy the link provided to you by the Jupyter session running on the first terminal window.
  11. Open a browser and paste the link. CHANGE the "hostname" in the link to "localhost". See example below: Original Link: http://hostname:8888/lab?token=8f89a890e5b48ad3a4e08058f7843f0d76e777cbd158071e New Link: http://localhost:8888/lab?token=8f89a890e5b48ad3a4e08058f7843f0d76e777cbd158071e
image

SLURM

SLURM is recommended for high-compute jobs such as model training. Use batch jobs to manage workloads efficiently.

  1. Open a terminal and run:

    ssh <VUnetID>@login.accre.vu
  2. Enter your VUnetID password.

  3. Prepare a Python script and upload it to your ACCRE home directory.

  4. Create a SLURM script (e.g., filename.slurm) with the following example:

    #!/bin/bash
    #SBATCH --job-name=stress_test                # Job name
    #SBATCH --output=stress_test.log        # Standard output log file
    #SBATCH --error=stress_test.log          # Standard error log file
    #SBATCH --partition=interactive         # Partition
    #SBATCH --account=dsi_dgx_iacc
    #SBATCH --qos=dgx_iacc
    #SBATCH --gres=gpu:1 # Request 1 GPU
    #SBATCH --time=3-00:00:00                   # Time limit (hh:mm:ss)
    #SBATCH --nodes=1                         # Number of nodes
    #SBATCH --ntasks=1                        # Number of tasks
    #SBATCH --cpus-per-task=6  
    #SBATCH --mem=80GB                         # Memory per node
    
    # Load Singularity module if required by your cluster
    #module load singularity
    
    # Define the Singularity container path
    CONTAINER_PATH="/data/p_dsi/singularity-containers/pytorch_25.01-py3.sif"
    
    #singularity shell $CONTAINER_PATH
    
    # Execute the code using the Singularity container
    singularity exec --nv  $CONTAINER_PATH python /home/vuNetID/stress-test.py
  5. Submit your batch job:

    sbatch filename.slurm
  6. Monitor your job status:

    squeue --job <job id>

For additional details, refer to the ACCRE Wiki and ACCRE SLURM Training.

Support

For questions, contact Umang Chaudhry via email or the Vanderbilt Data Science Slack. If you are able to spin up a session but specifically your code does not work, please put in an ACCRE helpdesk ticket.

About

A guide for all DGX users

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •