GPU fryer 🍳

GPU fryer is a tool to stress test GPUs and detect any abnormal thermal throttling or performance degradation. It is especially useful to test GPUs running ML inference or training workloads for which performances are dictated by the slowest GPU in the system.

We use it at Hugging Face 🤗 to monitor our HPC clusters and ensure that all GPUs are running at peak performance.

Quickstart

Use Docker:

docker run --gpus all ghcr.io/huggingface/gpu-fryer:latest 60

Usage

$ gpu-fryer 60  # Run the test for 60 seconds
...
GPU #7:  51494 Gflops/s (min: 50577.53, max: 51677.05, dev: 51493.79)
         Temperature: 48.83°C (min: 47.00, max: 50.00)
         Throttling HW: false, Thermal SW: false, Thermal HW: false
All GPUs seem healthy

Usage: gpu-fryer [OPTIONS] [DURATION_SECS]

Arguments:
  [DURATION_SECS]  Duration in seconds to burn the GPUs [default: 60]

Options:
      --nvml-lib-path <NVML_LIB_PATH>
          Path to NVIDIA Management Library (libnvidia-ml.so) [default: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1]
      --tolerate-software-throttling
          Tolerate software throttling if the TFLOPS are in the acceptable range
      --tflops-tolerance <TFLOPS_TOLERANCE>
          TFLOPS tolerance (%) from the average If the TFLOPS are within this range, test pass [default: 10]
  -h, --help
          Print help
  -V, --version
          Print version

GPU fryer relies on NVIDIA's CUDA toolkit to run the stress test, so make sure that your PATH includes the CUDA libs. NVML is used to monitor the GPU's temperature and throttling, in case of non default installations, you can use the --nvml-lib-path flag to specify the path to libnvidia-ml.so.

GPU fryer checks for homogeneous performance across all GPUs in the system (if multiple GPUs are present) and reports any performance degradation or thermal throttling. There is currently no absolute performance metric. For reference:

GPU	TFLOPS
NVIDIA H100 80GB HBM3	~51

Installation

$ cargo install gpu-fryer

How it works

GPU fryer creates two 8192x8192 matrix and performs a matrix multiplication using CUBLAS. Test allocates 95% of the GPU memory to write results in a ring buffer fashion.

Acknowledgements

The awesome GPU Burn, very similar tool but looking at computational errors.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU fryer 🍳

Quickstart

Usage

Installation

How it works

Acknowledgements

About

Releases

Packages

Languages

License

huggingface/gpu-fryer

Folders and files

Latest commit

History

Repository files navigation

GPU fryer 🍳

Quickstart

Usage

Installation

How it works

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages