Skip to content

ROCm/MAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MAD - Model Automation and Dashboarding

Overview

MAD (Model Automation and Dashboarding) is a comprehensive AI/ML model automation platform that provides:

  • πŸ—οΈ Model Zoo: Curated collection of AI/ML models
  • πŸš€ Automated Execution: Run models across various GPU architectures
  • πŸ“Š Performance Tracking: Historical performance data collection and analysis
  • πŸ“ˆ Dashboard Generation: Visual tracking and reporting capabilities

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard versionchanges, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED β€˜AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Β© 2025 Advanced Micro Devices, Inc. All Rights Reserved.

Table of Contents

Prerequisites

  • Docker installed and running
  • Python 3.9 or higher
  • GPU drivers (AMD ROCm or NVIDIA CUDA)

Quick Start

  1. Clone the repository:

    git clone <repository-url>
    cd MAD
  2. Install dependencies:

    pip install -r requirements.txt
  3. Run a model:

    madengine run --tags pyt_huggingface_bert

Usage Guide

Running Models

The madengine CLI ROCm/madengine provides a simple interface for running models locally. All models defined in models.json can be executed on a Docker host to collect performance results.

Please note that support of running models using tools/run_models.py is no longer recommended, and tools/run_models.py will be removed from MAD repo soon.

Basic Usage

madengine run [OPTIONS]

Available Options

Option Description Default
--tags TAGS Tags to filter models (space-separated) -
--timeout TIMEOUT Timeout in seconds 7200 (2 hours)
--live-output Show real-time output False
--clean-docker-cache Rebuild Docker images without cache False
--keep-alive Keep container running after completion False
--keep-model-dir Preserve model directory after run False
-o OUTPUT, --output OUTPUT Output file for results -
--log-level LOG_LEVEL Set logging level INFO

Execution Process

For each model, MAD performs the following steps:

  1. πŸ”¨ Build: Creates Docker image named ci-$(model_name)
  2. πŸš€ Start: Launches container named container_$(model_name)
  3. πŸ“₯ Clone: Downloads model repository from specified URL
  4. ▢️ Execute: Runs the model script
  5. πŸ“Š Report: Generates perf.csv and perf.html

Tag Functionality

Tags allow you to run specific subsets of models based on their characteristics:

  • Framework tags: pyt, tf2, ort
  • Model tags: bert, gpt2, resnet50
  • Precision tags: fp16, fp32
  • Custom tags: Any tag defined in models.json

Examples

# Run a specific model
madengine run --tags pyt_huggingface_bert

# Run all PyTorch models
madengine run --tags pyt

# Run multiple tag combinations
madengine run --tags tf2 bert fp32

Timeout Configuration

Configure execution timeouts at multiple levels:

  1. Default: 2 hours (7200 seconds)
  2. Model-specific: Set timeout field in models.json
  3. Runtime override: Use --timeout command line option

Note: Setting timeout to 0 disables the timeout entirely.

Debugging Options

For troubleshooting and development:

# See real-time logs
madengine run --tags model_name --live-output

# Keep container running for inspection
madengine run --tags model_name --keep-alive

# Rebuild Docker images from scratch
madengine run --tags model_name --clean-docker-cache

⚠️ Warning: When using --keep-alive, you must manually stop and remove the container before running the same model again.

Contributing

Adding New Models

Follow these steps to add a new model to the MAD repository:

Step 1: Create Workload Name

Follow the naming convention: {framework}_{project}_{workload}

Examples:

  • tf2_huggingface_gpt2
  • pyt_torchvision_resnet50
  • ort_onnx_bert

Step 2: Model Configuration

Add an entry to models.json:

{
  "name": "tf2_bert_large",
  "url": "https://github.com/ROCmSoftwarePlatform/bert",
  "dockerfile": "docker/tf2_bert_large",
  "scripts": "scripts/tf2_bert_large",
  "n_gpus": "4",
  "owner": "[email protected]",
  "training_precision": "fp32",
  "tags": [
    "per_commit",
    "tf2",
    "bert",
    "fp32"
  ],
  "args": ""
}

Configuration Fields

Field Required Description
name βœ… Unique model identifier
url βœ… Repository URL to clone
dockerfile βœ… Path to Dockerfile
scripts βœ… Path to script directory
n_gpus βœ… Number of GPUs (-1 for all available)
owner βœ… Contact email
training_precision βœ… Precision level (fp16, fp32, etc.)
tags βœ… List of tags for categorization
data ❌ Optional data path
timeout ❌ Model-specific timeout override
multiple_results ❌ CSV file for multiple results
args ❌ Additional script arguments

Step 3: Docker Setup

Create a Dockerfile in the docker/ directory:

# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
FROM rocm/tensorflow:latest

# Install system dependencies
RUN apt update && apt install -y \
    wget \
    unzip \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir \
    pandas \
    numpy

# Download model data
RUN URL=https://example.com/model-data.zip && \
    wget --directory-prefix=/data -c $URL && \
    ZIP_NAME=$(basename $URL) && \
    unzip /data/$ZIP_NAME -d /data && \
    rm /data/$ZIP_NAME

# Set working directory
WORKDIR /workspace

Step 4: Script Implementation

Create a script directory in scripts/ with a run.sh file:

#!/bin/bash
set -e

# Model configuration
MODEL_CONFIG_DIR=/data/model_config
BATCH_SIZE=2
SEQUENCE_LENGTH=512
TRAIN_STEPS=100
WARMUP_STEPS=10
LEARNING_RATE=1e-4

# Prepare data
echo "Preparing training data..."
python3 prepare_data.py \
    --config_dir=$MODEL_CONFIG_DIR \
    --batch_size=$BATCH_SIZE \
    --seq_length=$SEQUENCE_LENGTH

# Train model
echo "Starting model training..."
python3 train_model.py \
    --config_dir=$MODEL_CONFIG_DIR \
    --batch_size=$BATCH_SIZE \
    --max_seq_length=$SEQUENCE_LENGTH \
    --num_train_steps=$TRAIN_STEPS \
    --num_warmup_steps=$WARMUP_STEPS \
    --learning_rate=$LEARNING_RATE \
    2>&1 | tee training.log

# Report performance
echo "Generating performance metrics..."
python3 report_metrics.py

Performance Reporting

Single Result Format:

print(f"performance: {throughput} examples/sec")

Multiple Results Format: Create a CSV file with columns: models,performance,metric

models,performance,metric
model_1,156.7,examples/sec
model_2,89.3,tokens/sec

Environment Variables

System Variables

MAD provides system information through environment variables:

Variable Description
MAD_SYSTEM_GPU_ARCHITECTURE Host GPU architecture
MAD_RUNTIME_NGPUS Available GPU count

Model Variables

Runtime model configuration:

Variable Description
MAD_MODEL_NAME Model name from models.json
MAD_MODEL_NUM_EPOCHS Training epochs
MAD_MODEL_BATCH_SIZE Batch size

About

MAD (Model Automation and Dashboarding)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 16