Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark framework designed to assess AI workloads on novel accelerators. It has been developed and tested extensively on systems at the Jülich Supercomputing Centre (JSC).
CARAML leverages JUBE, a scripting-based framework for creating benchmark sets, running them across different systems, and evaluating results. Additionally, it includes power/energy measurements through the jpwr tool.
CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI, WEST-AI Nodes and NHR-FAU. These include the accelerators:
| System | Configuration | Tag |
|---------------------------------------------------|---------------------------------------------------|-----------|
| NVIDIA Ampere node (SXM) | 4 × A100 (40GB HBM2e) GPUs | `A100` |
| NVIDIA Hopper node (PCIe) | 4 × H100 (80GB HBM2e) GPUs | `H100` |
| NVIDIA Hopper node (NVLink) | 4 × H100 (94GB HBM2e) GPUs | `WAIH100` |
| NVIDIA Grace-Hopper chip | 1 × GH200 (480GB LPDDR5X, 96GB HBM3) GPU | `GH200` |
| NVIDIA Grace-Hopper node | 4 × GH200 (120GB LPDDR5X, 96GB HBM3) GPUs | `JUPITER` |
| AMD MI300X node | 8 × MI300X (192GB HBM3) GPUs | `MI300X` |
| AMD MI300A node | 4 × MI300A (128GB HBM3) APUs | `MI300A` |
| AMD MI200 node | 4 × MI250 (128GB HBM2e) GPUs | `MI250` |
| Graphcore IPU-POD4 M2000 | 4 × GC200 (512GB DDR4-3200) IPUs | `GC200` |
CARAML currently provides benchmarks implemented in Python:
The image_classification model training benchmark is implemented in PyTorch. It is designed to test image classification models such as ResNet50 on various accelerators. For IPU's graphcore/examples is used.
Performance is measured in images/s
and energy is measured in Wh
.
Note: Support for the image classification benchmark in TensorFlow has been discontinued.
The LLM-training benchmark is implemented in PyTorch with:
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88
and patch applied for NVIDIA - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22
and patch applied for AMD - graphcore/examples (forked version) for Graphcore
Performance is measured in tokens/s
and energy is recorded in Wh
.
To run the benchmarks, install JUBE following JUBE Installation Documentation setup instructions. The benchmarks are deployed using Apptainer containers and executed using SLURM on the tested accelerators.
-
Image Classification: Synthetic data is generated on the host machine for benchmarking. The IPU tag
synthetic
additionally allows for the generation of synthetic data directly on the IPU. -
LLM Training: A subset of the OSCAR dataset (790 samples, ~10 MB) is pre-processed using GPT-2 tokenizers. This data is provided in the
llm_data
directory.
- Clone the repository and navigate into it:
git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML
- Modify the
system
andmodel
parameters in the respective JUBE configuration file. - To pull the required container use the
container
tag as follows:Replacejube run {JUBEConfig}.{xml,yaml} --tag container H100
H100
with one of the following as needed:GH200
(for Arm CPU + H100)MI250
orMI300X
orMI300A
(for AMD)GC200
(for Graphcore)
Note: The
container
tag should ideally be used only once at the beginning to pull and set up the container.
-
To run the benchmark with defined configurations do
jube run image_classification/image_classification_torch_benchmark.xml --tag H100
H100
can be replaced with any tag mentioned in tested accelerators section. -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue image_classification/image_classification_torch_benchmark_run -i last
-
To generate result do:
jube result image_classification/image_classification_torch_benchmark_run -i last
-
To run the benchmark with defined configurations for
800M
GPT model with OSCAR data do:jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
A100
can be replaced with any tag mentioned in tested accelerators section and800M
can be replaced with13B
and175B
for systems with more node resources. -
To run the benchmark with defined configurations for
117M
GPT model on Graphcore with synthetic data dojube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
If tag
synthetic
is not given, the benchmark will use OSCAR data. -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last
-
To generate result do:
jube result llm_training/llm_benchmark_{nvidia_amd,ipu}_run -i last
In order to use PyTorch torch run
API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.
Additionally the hostname
is appended with an i
for allowing communication over InfiniBand as described here.
@INPROCEEDINGS{10820809,
author={John, Chelsea Maria and Nassyr, Stepan and Penke, Carolin and Herten, Andreas},
booktitle={SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
title={Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML},
year={2024},
pages={1164-1176},
doi={10.1109/SCW63240.2024.00158}
}