MAD (Model Automation and Dashboarding) is a comprehensive AI/ML model automation platform that provides:
- ποΈ Model Zoo: Curated collection of AI/ML models
- π Automated Execution: Run models across various GPU architectures
- π Performance Tracking: Historical performance data collection and analysis
- π Dashboard Generation: Visual tracking and reporting capabilities
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED βAS IS.β AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Β© 2025 Advanced Micro Devices, Inc. All Rights Reserved.
- Docker installed and running
- Python 3.9 or higher
- GPU drivers (AMD ROCm or NVIDIA CUDA)
-
Clone the repository:
git clone <repository-url> cd MAD
-
Install dependencies:
pip install -r requirements.txt
-
Run a model:
madengine run --tags pyt_huggingface_bert
The madengine CLI ROCm/madengine provides a simple interface for running models locally. All models defined in models.json
can be executed on a Docker host to collect performance results.
Please note that support of running models using tools/run_models.py is no longer recommended, and tools/run_models.py will be removed from MAD repo soon.
madengine run [OPTIONS]
Option | Description | Default |
---|---|---|
--tags TAGS |
Tags to filter models (space-separated) | - |
--timeout TIMEOUT |
Timeout in seconds | 7200 (2 hours) |
--live-output |
Show real-time output | False |
--clean-docker-cache |
Rebuild Docker images without cache | False |
--keep-alive |
Keep container running after completion | False |
--keep-model-dir |
Preserve model directory after run | False |
-o OUTPUT, --output OUTPUT |
Output file for results | - |
--log-level LOG_LEVEL |
Set logging level | INFO |
For each model, MAD performs the following steps:
- π¨ Build: Creates Docker image named
ci-$(model_name)
- π Start: Launches container named
container_$(model_name)
- π₯ Clone: Downloads model repository from specified URL
βΆοΈ Execute: Runs the model script- π Report: Generates
perf.csv
andperf.html
Tags allow you to run specific subsets of models based on their characteristics:
- Framework tags:
pyt
,tf2
,ort
- Model tags:
bert
,gpt2
,resnet50
- Precision tags:
fp16
,fp32
- Custom tags: Any tag defined in
models.json
# Run a specific model
madengine run --tags pyt_huggingface_bert
# Run all PyTorch models
madengine run --tags pyt
# Run multiple tag combinations
madengine run --tags tf2 bert fp32
Configure execution timeouts at multiple levels:
- Default: 2 hours (7200 seconds)
- Model-specific: Set
timeout
field inmodels.json
- Runtime override: Use
--timeout
command line option
Note: Setting timeout to
0
disables the timeout entirely.
For troubleshooting and development:
# See real-time logs
madengine run --tags model_name --live-output
# Keep container running for inspection
madengine run --tags model_name --keep-alive
# Rebuild Docker images from scratch
madengine run --tags model_name --clean-docker-cache
β οΈ Warning: When using--keep-alive
, you must manually stop and remove the container before running the same model again.
Follow these steps to add a new model to the MAD repository:
Follow the naming convention: {framework}_{project}_{workload}
Examples:
tf2_huggingface_gpt2
pyt_torchvision_resnet50
ort_onnx_bert
Add an entry to models.json
:
{
"name": "tf2_bert_large",
"url": "https://github.com/ROCmSoftwarePlatform/bert",
"dockerfile": "docker/tf2_bert_large",
"scripts": "scripts/tf2_bert_large",
"n_gpus": "4",
"owner": "[email protected]",
"training_precision": "fp32",
"tags": [
"per_commit",
"tf2",
"bert",
"fp32"
],
"args": ""
}
Field | Required | Description |
---|---|---|
name |
β | Unique model identifier |
url |
β | Repository URL to clone |
dockerfile |
β | Path to Dockerfile |
scripts |
β | Path to script directory |
n_gpus |
β | Number of GPUs (-1 for all available) |
owner |
β | Contact email |
training_precision |
β | Precision level (fp16, fp32, etc.) |
tags |
β | List of tags for categorization |
data |
β | Optional data path |
timeout |
β | Model-specific timeout override |
multiple_results |
β | CSV file for multiple results |
args |
β | Additional script arguments |
Create a Dockerfile in the docker/
directory:
# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
FROM rocm/tensorflow:latest
# Install system dependencies
RUN apt update && apt install -y \
wget \
unzip \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
RUN pip install --no-cache-dir \
pandas \
numpy
# Download model data
RUN URL=https://example.com/model-data.zip && \
wget --directory-prefix=/data -c $URL && \
ZIP_NAME=$(basename $URL) && \
unzip /data/$ZIP_NAME -d /data && \
rm /data/$ZIP_NAME
# Set working directory
WORKDIR /workspace
Create a script directory in scripts/
with a run.sh
file:
#!/bin/bash
set -e
# Model configuration
MODEL_CONFIG_DIR=/data/model_config
BATCH_SIZE=2
SEQUENCE_LENGTH=512
TRAIN_STEPS=100
WARMUP_STEPS=10
LEARNING_RATE=1e-4
# Prepare data
echo "Preparing training data..."
python3 prepare_data.py \
--config_dir=$MODEL_CONFIG_DIR \
--batch_size=$BATCH_SIZE \
--seq_length=$SEQUENCE_LENGTH
# Train model
echo "Starting model training..."
python3 train_model.py \
--config_dir=$MODEL_CONFIG_DIR \
--batch_size=$BATCH_SIZE \
--max_seq_length=$SEQUENCE_LENGTH \
--num_train_steps=$TRAIN_STEPS \
--num_warmup_steps=$WARMUP_STEPS \
--learning_rate=$LEARNING_RATE \
2>&1 | tee training.log
# Report performance
echo "Generating performance metrics..."
python3 report_metrics.py
Single Result Format:
print(f"performance: {throughput} examples/sec")
Multiple Results Format:
Create a CSV file with columns: models,performance,metric
models,performance,metric
model_1,156.7,examples/sec
model_2,89.3,tokens/sec
MAD provides system information through environment variables:
Variable | Description |
---|---|
MAD_SYSTEM_GPU_ARCHITECTURE |
Host GPU architecture |
MAD_RUNTIME_NGPUS |
Available GPU count |
Runtime model configuration:
Variable | Description |
---|---|
MAD_MODEL_NAME |
Model name from models.json |
MAD_MODEL_NUM_EPOCHS |
Training epochs |
MAD_MODEL_BATCH_SIZE |
Batch size |