Vanderbilt Data Science Institute - DGX A100 User Guide

A comprehensive guide to using the DGX A100 systems for authorized users.

The Data Science Institute (DSI) has four DGX A100 systems, now integrated into ACCRE's cluster. Access is provided to participants of DSI projects or those awarded a DSI Compute Grant for DGX.

Infrastructure

The DSI maintains 4 DGX A100 systems, available in two configurations:

8 x 40GB A100 GPUs (2 Machines, up to 320GB total GPU RAM each).
8 x 80GB A100 GPUs (2 Machines, up to 640GB total GPU RAM each).

These machines are interconnected via InfiniBand for multi-GPU and multi-node High-Performance Computing (HPC). Access to the GPUs is allocated on a first-come, first-served basis.

Resource Allocation

The DGX systems are shared among DSI Graduate Students, Faculty, Staff, and affiliated lab groups.
Job and resource management is handled via SLURM, a dynamic job scheduler.
High-demand periods or large resource requests may increase wait times.
Please only request the resources you need. Due to high demand, we encourage you to not request long sessions for experimentation - please consider building locally and then implementing your code on ACCRE whenever possible

Data Management

All work is saved to your ACCRE home directory.
Upon logging in, you will start in your ACCRE home directory. If you require additional storage, please reach out to us or ACCRE for a custom solution

Containers

Custom Singularity containers are supported.
Docker is not available due to security concerns.

For details on access methods, see Accessing the DGXs.

Setup

Requesting an Account

To use the DGX systems, you must request an account by completing the DSI Compute Grant for DGX. If you have recieved an email stating you've been provisioned access, you do not need to complete this form.

Accessing the DGXs

There are four primary methods to access the DGX systems:

Jupyter Notebooks
ACCRE GPU Desktop
salloc
SLURM batch jobs

Confirming Access

The slurm_resources command will show you what resources you can use. Under Account, you should see dsi_dgx as an option unless you have a different research group that has been provisioned with DGX Access. If you are a DSI student, you should also see p_dsi. Please reach out to Umang Chaudhry if you do not see either of these accounts.

Scroll to section regarding Accounts and QOS for accessing the interactive GPU partition.
- You should see dsi_dgx_iacc under Accounts and dgx_iacc under QOS. Make a note of these two values as you will need them request resources.

Jupyter Notebooks

Jupyter Notebooks offer a straightforward way to access GPUs, though this method is limited to notebook-based workflows. For custom applications or containers, consider using the salloc method.

Visit the ACCRE Visualization Portal: http://viz.accre.vu. Log in with your VUnetID and password.
Select Interactive Apps.
Choose ACCRE JupyterLab.
Provide the duration of your session in hours
Provide your ACCRE SLURM account (dsi_dgx_iacc)
Select interactive_gpu (GPU accelerated nodes, ready on-demand) as the Partition
Provide your QOS (Quality of Service) designation (dgx_iacc)
Optionally provide the memory and number of CPU cores you require. If nothing provided, it will default to the specifications of the GPU you request
Specify required GPU type - Nvidia A 100-SXM4 (DGX 80 GB) or Nvidia A 100-SXM4 (DGX 40 GB)
Provide the number of GPUs you require
If using a custom virtual environment or container, provide the necessary information under Advanced Options
Launch the session. Your session will queue and begin based on resource availability.

ACCRE GPU Desktop

ACCRE GPU Desktop offers a virtual desktop environment for interactive GPU workflows.

Visit the ACCRE Visualization Portal: http://viz.accre.vu. Log in with your VUnetID and password.
Select Interactive Apps.
Choose ACCRE GPU Desktop.
Provide the duration of your session in hours
Provide your ACCRE SLURM account (dsi_dgx_iacc)
Select interactive_gpu (GPU accelerated nodes, ready on-demand) as the Partition
Provide your QOS (Quality of Service) designation (dgx_iacc)
Optionally provide the memory and number of CPU cores you require. If nothing provided, it will default to the specifications of the GPU you request
Specify required GPU type - Nvidia A 100-SXM4 (DGX 80 GB) or Nvidia A 100-SXM4 (DGX 40 GB)
Provide the number of GPUs you require
Optionally, provide a custom screen resolution.
Launch the session. Your session will queue and start based on availability.

`salloc`

The salloc method provides direct shell access to the DGX systems and is ideal for running custom applications or workflows.

Open a terminal and run:
```
ssh <VUnetID>@login.accre.vu
```
Enter your VUnetID password.
Navigate your ACCRE home directory using ls.

Request a direct shell into the DGX system with the following command:

salloc --time=1:00:00 --partition=interactive_gpu --account=dsi_dgx_iacc --qos=dgx_iacc --gres=gpu:nvidia_a100-sxm4-40gb:1

For 80GB GPUs, use:
```
--gres=gpu:nvidia_a100-sxm4-80gb:1
```
Adjust the time and GPU count as needed.

Use nvidia-smi to verify your resources.
Launch your workflows using Singularity containers.

Running Jupyter notebooks via `salloc`

Running Jupyter notebooks from within a container requires a few extra steps due to the need for port forwarding:

Open a terminal and run:
```
ssh <VUnetID>@login.accre.vu
```
Enter your VUnetID password.
Navigate your ACCRE home directory using ls.

Request a direct shell into the DGX system with the following command:

salloc --time=1:00:00 --partition=interactive --account=dsi_dgx_iacc --qos=dgx_iacc --gres=gpu:nvidia_a100-sxm4-40gb:1

Make note of the machine you landed on (dgx01, dgx02, dgx03, or dgx04)
Navigate to the location of your singularity container
Run the following command: singularity exec --nv --bind /home/vuNetID:/home/vuNetID pytorch_25.01-py3.sif jupyter-lab --notebook-dir=/home/vuNetID --ip=0.0.0.0 --no-browser. This starts a Jupyter Lab session with the workspace bound to your home directory. You can modify this to work off of any directory of your choice. Ensure you have Read, Write and Execute access to this directory.
Open a NEW terminal window. Keep your previous terminal open and running.
Run the following: ssh [email protected] -L 8888:<dgx03>:8888 (Replace dgx03 with your machine from step 5)
Now, copy the link provided to you by the Jupyter session running on the first terminal window.
Open a browser and paste the link. CHANGE the "hostname" in the link to "localhost". See example below: Original Link: http://hostname:8888/lab?token=8f89a890e5b48ad3a4e08058f7843f0d76e777cbd158071e New Link: http://localhost:8888/lab?token=8f89a890e5b48ad3a4e08058f7843f0d76e777cbd158071e

SLURM

SLURM is recommended for high-compute jobs such as model training. Use batch jobs to manage workloads efficiently.

Open a terminal and run:
```
ssh <VUnetID>@login.accre.vu
```
Enter your VUnetID password.
Prepare a Python script and upload it to your ACCRE home directory.

Create a SLURM script (e.g., filename.slurm) with the following example:

#!/bin/bash
#SBATCH --job-name=stress_test                # Job name
#SBATCH --output=stress_test.log        # Standard output log file
#SBATCH --error=stress_test.log          # Standard error log file
#SBATCH --partition=interactive_gpu        # Partition
#SBATCH --account=dsi_dgx_iacc
#SBATCH --qos=dgx_iacc
#SBATCH --gres=gpu:1 # Request 1 GPU
#SBATCH --time=3-00:00:00                   # Time limit (hh:mm:ss)
#SBATCH --nodes=1                         # Number of nodes
#SBATCH --ntasks=1                        # Number of tasks
#SBATCH --cpus-per-task=6  
#SBATCH --mem=80GB                         # Memory per node

# Load Singularity module if required by your cluster
#module load singularity

# Define the Singularity container path
CONTAINER_PATH="/data/p_dsi/singularity-containers/pytorch_25.01-py3.sif"

#singularity shell $CONTAINER_PATH

# Execute the code using the Singularity container
singularity exec --nv  $CONTAINER_PATH python /home/vuNetID/stress-test.py

Submit your batch job:
```
sbatch filename.slurm
```
Monitor your job status:
```
squeue --job <job id>
```

For additional details, refer to the ACCRE Wiki and ACCRE SLURM Training.

Support

For questions, contact Umang Chaudhry via email or the Vanderbilt Data Science Slack. If you are able to spin up a session but specifically your code does not work, please put in an ACCRE helpdesk ticket.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vanderbilt Data Science Institute - DGX A100 User Guide

Infrastructure

Resource Allocation

Data Management

Containers

Setup

Requesting an Account

Accessing the DGXs

Confirming Access

Jupyter Notebooks

ACCRE GPU Desktop

`salloc`

Running Jupyter notebooks via `salloc`

SLURM

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

License

vanderbilt-data-science/dgx-user-guide

Folders and files

Latest commit

History

Repository files navigation

Vanderbilt Data Science Institute - DGX A100 User Guide

Infrastructure

Resource Allocation

Data Management

Containers

Setup

Requesting an Account

Accessing the DGXs

Confirming Access

Jupyter Notebooks

ACCRE GPU Desktop

salloc

Running Jupyter notebooks via salloc

SLURM

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

`salloc`

Running Jupyter notebooks via `salloc`

Packages