MAD - Model Automation and Dashboarding

Overview

MAD (Model Automation and Dashboarding) is a comprehensive AI/ML model automation platform that provides:

🏗️ Model Zoo: Curated collection of AI/ML models
🚀 Automated Execution: Run models across various GPU architectures
📊 Performance Tracking: Historical performance data collection and analysis
📈 Dashboard Generation: Visual tracking and reporting capabilities

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Prerequisites

Docker installed and running
Python 3.9 or higher
GPU drivers (AMD ROCm or NVIDIA CUDA)

Quick Start

Clone the repository:
```
git clone <repository-url>
cd MAD
```
Install dependencies:
```
pip install -r requirements.txt
```

Run a model:

madengine run --tags pyt_huggingface_bert

Usage Guide

Running Models

The madengine CLI ROCm/madengine provides a simple interface for running models locally. All models defined in models.json can be executed on a Docker host to collect performance results.

Please note that support of running models using tools/run_models.py is no longer recommended, and tools/run_models.py will be removed from MAD repo soon.

Basic Usage

madengine run [OPTIONS]

Available Options

Option	Description	Default
`--tags TAGS`	Tags to filter models (space-separated)	-
`--timeout TIMEOUT`	Timeout in seconds	7200 (2 hours)
`--live-output`	Show real-time output	False
`--clean-docker-cache`	Rebuild Docker images without cache	False
`--keep-alive`	Keep container running after completion	False
`--keep-model-dir`	Preserve model directory after run	False
`-o OUTPUT, --output OUTPUT`	Output file for results	-
`--log-level LOG_LEVEL`	Set logging level	INFO

Execution Process

For each model, MAD performs the following steps:

🔨 Build: Creates Docker image named ci-$(model_name)
🚀 Start: Launches container named container_$(model_name)
📥 Clone: Downloads model repository from specified URL
▶️ Execute: Runs the model script
📊 Report: Generates perf.csv and perf.html

Tag Functionality

Tags allow you to run specific subsets of models based on their characteristics:

Framework tags: pyt, tf2, ort
Model tags: bert, gpt2, resnet50
Precision tags: fp16, fp32
Custom tags: Any tag defined in models.json

Examples

# Run a specific model
madengine run --tags pyt_huggingface_bert

# Run all PyTorch models
madengine run --tags pyt

# Run multiple tag combinations
madengine run --tags tf2 bert fp32

Timeout Configuration

Configure execution timeouts at multiple levels:

Default: 2 hours (7200 seconds)
Model-specific: Set timeout field in models.json
Runtime override: Use --timeout command line option

Note: Setting timeout to 0 disables the timeout entirely.

Debugging Options

For troubleshooting and development:

# See real-time logs
madengine run --tags model_name --live-output

# Keep container running for inspection
madengine run --tags model_name --keep-alive

# Rebuild Docker images from scratch
madengine run --tags model_name --clean-docker-cache

⚠️ Warning: When using --keep-alive, you must manually stop and remove the container before running the same model again.

Contributing

Adding New Models

Follow these steps to add a new model to the MAD repository:

Step 1: Create Workload Name

Follow the naming convention: {framework}_{project}_{workload}

Examples:

tf2_huggingface_gpt2
pyt_torchvision_resnet50
ort_onnx_bert

Step 2: Model Configuration

Add an entry to models.json:

{
  "name": "tf2_bert_large",
  "url": "https://github.com/ROCmSoftwarePlatform/bert",
  "dockerfile": "docker/tf2_bert_large",
  "scripts": "scripts/tf2_bert_large",
  "n_gpus": "4",
  "owner": "[email protected]",
  "training_precision": "fp32",
  "tags": [
    "per_commit",
    "tf2",
    "bert",
    "fp32"
  ],
  "args": ""
}

Configuration Fields

Field	Required	Description
`name`	✅	Unique model identifier
`url`	✅	Repository URL to clone
`dockerfile`	✅	Path to Dockerfile
`scripts`	✅	Path to script directory
`n_gpus`	✅	Number of GPUs (`-1` for all available)
`owner`	✅	Contact email
`training_precision`	✅	Precision level (fp16, fp32, etc.)
`tags`	✅	List of tags for categorization
`data`	❌	Optional data path
`timeout`	❌	Model-specific timeout override
`multiple_results`	❌	CSV file for multiple results
`args`	❌	Additional script arguments

Step 3: Docker Setup

Create a Dockerfile in the docker/ directory:

# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
FROM rocm/tensorflow:latest

# Install system dependencies
RUN apt update && apt install -y \
    wget \
    unzip \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir \
    pandas \
    numpy

# Download model data
RUN URL=https://example.com/model-data.zip && \
    wget --directory-prefix=/data -c $URL && \
    ZIP_NAME=$(basename $URL) && \
    unzip /data/$ZIP_NAME -d /data && \
    rm /data/$ZIP_NAME

# Set working directory
WORKDIR /workspace

Step 4: Script Implementation

Create a script directory in scripts/ with a run.sh file:

#!/bin/bash
set -e

# Model configuration
MODEL_CONFIG_DIR=/data/model_config
BATCH_SIZE=2
SEQUENCE_LENGTH=512
TRAIN_STEPS=100
WARMUP_STEPS=10
LEARNING_RATE=1e-4

# Prepare data
echo "Preparing training data..."
python3 prepare_data.py \
    --config_dir=$MODEL_CONFIG_DIR \
    --batch_size=$BATCH_SIZE \
    --seq_length=$SEQUENCE_LENGTH

# Train model
echo "Starting model training..."
python3 train_model.py \
    --config_dir=$MODEL_CONFIG_DIR \
    --batch_size=$BATCH_SIZE \
    --max_seq_length=$SEQUENCE_LENGTH \
    --num_train_steps=$TRAIN_STEPS \
    --num_warmup_steps=$WARMUP_STEPS \
    --learning_rate=$LEARNING_RATE \
    2>&1 | tee training.log

# Report performance
echo "Generating performance metrics..."
python3 report_metrics.py

Performance Reporting

Single Result Format:

print(f"performance: {throughput} examples/sec")

Multiple Results Format: Create a CSV file with columns: models,performance,metric

models,performance,metric
model_1,156.7,examples/sec
model_2,89.3,tokens/sec

Environment Variables

System Variables

MAD provides system information through environment variables:

Variable	Description
`MAD_SYSTEM_GPU_ARCHITECTURE`	Host GPU architecture
`MAD_RUNTIME_NGPUS`	Available GPU count

Model Variables

Runtime model configuration:

Variable	Description
`MAD_MODEL_NAME`	Model name from `models.json`
`MAD_MODEL_NUM_EPOCHS`	Training epochs
`MAD_MODEL_BATCH_SIZE`	Batch size

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
benchmark		benchmark
docker		docker
scripts		scripts
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
data.json		data.json
models.json		models.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MAD - Model Automation and Dashboarding

Overview

DISCLAIMER

Table of Contents

Prerequisites

Quick Start

Usage Guide

Running Models

Basic Usage

Available Options

Execution Process

Tag Functionality

Examples

Timeout Configuration

Debugging Options

Contributing

Adding New Models

Step 1: Create Workload Name

Step 2: Model Configuration

Configuration Fields

Step 3: Docker Setup

Step 4: Script Implementation

Performance Reporting

Environment Variables

System Variables

Model Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 16

Uh oh!

Languages

License

ROCm/MAD

Folders and files

Latest commit

History

Repository files navigation

MAD - Model Automation and Dashboarding

Overview

DISCLAIMER

Table of Contents

Prerequisites

Quick Start

Usage Guide

Running Models

Basic Usage

Available Options

Execution Process

Tag Functionality

Examples

Timeout Configuration

Debugging Options

Contributing

Adding New Models

Step 1: Create Workload Name

Step 2: Model Configuration

Configuration Fields

Step 3: Docker Setup

Step 4: Script Implementation

Performance Reporting

Environment Variables

System Variables

Model Variables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 16

Uh oh!

Languages

Packages