GitHub - huggingface/AIEnergyScore: AI Energy Score: Initiative to establish comparable energy efficiency ratings for AI models.

Welcome to AI Energy Score! This is an initiative to establish comparable energy efficiency ratings for AI models, helping the industry make informed decisions about sustainability in AI development.

Key Links

Quick Start

Get started benchmarking AI models in 5 steps:

1. Install Development Tools

cd AIEnergyScore

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install development dependencies (for running scripts on host)
pip install -r requirements-dev.txt

2. Authenticate with HuggingFace (Optional - for gated models)

If you plan to test gated models (like Gemma or Llama):

# One-time login
hf auth login

3. Build the Docker Image

./build.sh

4. Run Your First Benchmark

# Quick test with 20 samples (default)
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b

# Or customize the number of samples
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-120b

5. View Results

Results are saved in ./results/ with energy data in:

GPU_ENERGY_WH.txt - Total energy consumption
GPU_ENERGY_SUMMARY.json - Detailed metrics

Quick Start: Batch Testing

Test multiple models automatically from a CSV configuration:

cd AIEnergyScore

# Install development dependencies (if not already done)
source .venv/bin/activate
pip install -r requirements-dev.txt

# Test a single model first (recommended)
python batch_runner.py \
  --model-name "gpt-oss-20b" \
  --reasoning-state "High" \
  --num-prompts 3 \
  --output-dir ./test_run

# Run all gpt-oss Class A models (smaller models)
python batch_runner.py \
  --model-name gpt-oss \
  --num-prompts 10 \
  --class A
  --output-dir ./gpt_oss_results

Results are aggregated in batch_results/master_results.csv with detailed logs in batch_results/logs/. See Batch Testing Multiple Models for full documentation.

Next Steps:

Compare different models
Run batch tests
Submit to the leaderboard

🔐 Gated Model Support

AIEnergyScore supports automatic authentication for gated models on HuggingFace! Simply run hf auth login once (Step 2 above), and you'll have seamless access to models like Gemma, Llama, and other restricted models. See Authentication for Gated Models for details.

Evaluating a Proprietary Model

Hardware

The Dockerfile provided in this repository is made to be used on the NVIDIA H100-80GB GPU. If you would like to run benchmarks on other types of hardware, we invite you to take a look at these configuration examples that can be run directly with AI Energy Benchmark. However, evaluations completed on other hardware would not be currently compatable and comparable with the rest of the AI Energy Score data.

Usage

Building the Docker Image

The Docker image includes both optimum-benchmark and ai_energy_benchmarks. Use the provided build script:

./build.sh

Quick Start with Helper Script

For convenience, use the provided helper script that handles all volume mounts automatically:

cd AIEnergyScore
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b

The helper script automatically:

Runs container as current user
Mounts HuggingFace cache from ~/.cache/huggingface
Automatically detects and mounts HuggingFace authentication tokens (for gated models)
Creates and mounts results directory
Configures proper environment variables
Defaults to 20 prompts (customize with -n or --num-samples)

Note for Gated Models: If you need to access gated models (like google/gemma-3-4b-pt or Meta Llama models), run huggingface-cli login first. See Authentication for Gated Models for details.

Examples:

# Use default 20 samples (default)
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b

# Test with 100 samples
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-120b

# Full test with 1000 samples
./run_docker.sh --num-samples 1000 --config-name text_generation backend.model=openai/gpt-oss-20b

# Use Optimum backend with HuggingFace optimum-benchmark
BENCHMARK_BACKEND=optimum ./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b

Authentication for Gated Models

Some models on HuggingFace (e.g., google/gemma-3-1b-pt, Meta Llama models) require authentication to access. The run_docker.sh script automatically handles authentication using two methods:

Method 1: HuggingFace CLI Login (Recommended)

The easiest way to authenticate is using the HuggingFace CLI:

# One-time setup: login to HuggingFace
huggingface-cli login #legacy
or
hf auth login 

# Then use normally - token is automatically mounted
./run_docker.sh --config-name text_generation backend.model=google/gemma-3-1b-pt

This creates a token file at ~/.huggingface/token which is automatically detected and mounted by run_docker.sh.

Method 2: HF_TOKEN Environment Variable

Alternatively, you can pass your token explicitly:

# Get your token from https://huggingface.co/settings/tokens
export HF_TOKEN=hf_your_token_here

# Run with token from environment
HF_TOKEN=hf_xxx ./run_docker.sh --config-name text_generation backend.model=google/gemma-3-1b-pt

The run_docker.sh script will display a warning if no authentication is found when you run it.

Manual Usage

Alternatively, you can run your benchmark manually. Important: Create the results directory first to avoid permission errors:

# Create results directory with proper permissions
mkdir -p results

#for example
docker run --gpus all --shm-size 1g \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/home/user/.cache/huggingface \
  -v $(pwd)/results:/results \
  -e HOME=/home/user \
  ai_energy_score \
  --config-name text_generation \
  scenario.num_samples=3 \
  backend.model=openai/gpt-oss-20b

# For gated models, add token file mount and/or HF_TOKEN
docker run --gpus all --shm-size 1g \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/home/user/.cache/huggingface \
  -v ~/.huggingface/token:/home/user/.huggingface/token:ro \
  -v $(pwd)/results:/results \
  -e HOME=/home/user \
  -e HF_TOKEN=hf_your_token_here \
  ai_energy_score \
  --config-name text_generation \
  backend.model=google/gemma-3-1b-pt

where my_task is the name of a task with a configuration here, my_model is the name of your model that you want to test (which needs to be compatible with either the Transformers or the Diffusers libraries) and my_processor is the name of the tokenizer/processor you want to use. In most cases, backend.model and backend.processor will be identical, except in cases where a model is using another model's tokenizer (e.g. from a LLaMa model).

The rest of the configuration is explained here

Backend Selection

AIEnergyScore supports multiple benchmark backends for flexibility and validation:

Backend	Tool	Load Generation	Model Location	Use Case
`pytorch` (default)	ai_energy_benchmarks	ai_energy_benchmarks generates load	Local GPU (in container)	Standard AIEnergyScore benchmarks
`optimum`	optimum-benchmark	optimum-benchmark generates load	Local GPU (in container)	Alternative HuggingFace backend
`vllm`	ai_energy_benchmarks	ai_energy_benchmarks generates load	External vLLM server	Production load testing

Default Backend (PyTorch):

The default pytorch backend uses the ai_energy_benchmarks framework, which loads models directly from HuggingFace or local paths for inference. This backend provides full control over model configuration including quantization, device mapping, and multi-GPU support. It measures raw model performance without serving overhead, making it ideal for controlled experiments and head-to-head model comparisons. The PyTorch backend automatically handles model sharding across multiple GPUs for large models and supports reasoning-capable models with automatic prompt formatting.

Default Usage (PyTorch/ai_energy_benchmarks)

# Standard AIEnergyScore benchmark - run as current user with cache mounting
docker run --gpus all --shm-size 1g \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/home/user/.cache/huggingface \
  -v $(pwd)/results:/results \
  -e HOME=/home/user \
  ai_energy_score \
  --config-name text_generation \
  scenario.num_samples=20 \
  backend.model=openai/gpt-oss-120b

Volume mounts:

~/.cache/huggingface:/home/user/.cache/huggingface - Reuse local HuggingFace model cache (avoids re-downloading)
$(pwd)/results:/results - Persist benchmark results to local directory
--user $(id -u):$(id -g) - Run as current user (not root) for proper file permissions
-e HOME=/home/user - Set HOME environment variable for HuggingFace cache location

Optimum Backend (optimum-benchmark)

# Use HuggingFace optimum-benchmark backend
docker run --gpus all --shm-size 1g \
  --user $(id -u):$(id -g) \
  -v ~/.cache/huggingface:/home/user/.cache/huggingface \
  -v $(pwd)/results:/results \
  -e HOME=/home/user \
  -e BENCHMARK_BACKEND=optimum \
  ai_energy_score \
  --config-name text_generation \
  backend.model=openai/gpt-oss-20b

Note: With BENCHMARK_BACKEND=pytorch, ai_energy_benchmarks loads the model and generates inference load directly on the GPU, just like optimum-benchmark.

Example: Comparing Energy Efficiency Across Models

AIEnergyScore makes it easy to compare the energy efficiency of different models. Here are practical examples:

Compare Small vs Large Models

cd AIEnergyScore

# Benchmark a smaller model (Class A: ~3B parameters)
./run_docker.sh -n 100 --config-name text_generation backend.model=HuggingFaceTB/SmolLM3-3B

# Benchmark a larger model (Class B: ~20B parameters)
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-20b

Compare Different Model Families

# Benchmark Gemma family model
./run_docker.sh -n 100 --config-name text_generation backend.model=google/gemma-3-4b-pt

# Benchmark Qwen family model
./run_docker.sh -n 100 --config-name text_generation backend.model=Qwen/Qwen2.5-Coder-14B

# Benchmark Mistral family model
./run_docker.sh -n 100 --config-name text_generation backend.model=mistralai/Mistral-Nemo-Instruct-2407

Compare Reasoning vs Non-Reasoning Modes

# Test with reasoning disabled (fixed 10 tokens)
./run_docker.sh -n 20 \
  --config-name text_generation \
  backend.model=openai/gpt-oss-20b \
  scenario.reasoning=False

# Test with low reasoning effort
./run_docker.sh -n 20 \
  --config-name text_generation \
  backend.model=openai/gpt-oss-20b \
  scenario.reasoning=True \
  scenario.reasoning_params.reasoning_effort=low

# Test with medium reasoning effort
./run_docker.sh -n 20 \
  --config-name text_generation \
  backend.model=openai/gpt-oss-20b \
  scenario.reasoning=True \
  scenario.reasoning_params.reasoning_effort=medium

# Test with high reasoning effort
./run_docker.sh -n 20 \
  --config-name text_generation \
  backend.model=openai/gpt-oss-20b \
  scenario.reasoning=True \
  scenario.reasoning_params.reasoning_effort=high

Note: Reasoning parameters are configured via Hydra command-line overrides. The system automatically detects the model type and applies the appropriate formatting (e.g., Harmony format for gpt-oss models). Legacy config files (text_generation_gptoss_reasoning_*.yaml) are deprecated but still functional for backward compatibility.

After running these benchmarks, results are saved in ./results/ with energy consumption data in GPU_ENERGY_WH.txt and GPU_ENERGY_SUMMARY.json files.

Requirements Structure

AIEnergyScore uses two separate requirements files:

File	Purpose	Usage
`requirements.txt`	Runtime dependencies for the Docker container	Installed automatically during `./build.sh`
`requirements-dev.txt`	Development/deployment tools for the host machine	Install with `pip install -r requirements-dev.txt`

Development requirements include:

huggingface-hub[cli] - Model downloads and authentication
pandas - Batch runner CSV processing
pytest, ruff, mypy, black - Testing and code quality tools
docker, python-dotenv, pyyaml - Container and config management

Running Tests

cd AIEnergyScore
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
pytest

Use pytest -m e2e to run only the end-to-end suites; omit the marker filter to execute the full test collection.

Batch Testing Multiple Models

The batch_runner.py script enables automated testing of multiple models from a CSV configuration file, with support for model-specific parameters and reasoning configurations.

Quick Start (Docker Backend)

For the PyTorch backend (default), you only need development dependencies installed locally - all AI work runs in Docker:

cd AIEnergyScore

# Create virtual environment and install development dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt

# Test a single model first (recommended)
python batch_runner.py \
  --model-name "gpt-oss-20b" \
  --reasoning-state "High" \
  --num-prompts 3 \
  --output-dir ./test_run

# Run batch tests (uses Docker internally)
python batch_runner.py \
  --model-name "gemma" \
  --output-dir ./results/gemma \
  --num-prompts 10

Common Usage Patterns

# Test all gpt-oss models
python batch_runner.py --model-name "gpt-oss" --num-prompts 20

# Test specific model class
python batch_runner.py --class A --num-prompts 50  # Small models
python batch_runner.py --class B --num-prompts 10  # Medium models
python batch_runner.py --class C --num-prompts 5   # Large models

# Filter by reasoning state
python batch_runner.py --reasoning-state "High" --num-prompts 10

# Combine filters
python batch_runner.py \
  --model-name gpt-oss \
  --reasoning-state "High" \
  --num-prompts 10

# Full benchmark run with timestamped output
python batch_runner.py \
  --output-dir ./full_results_$(date +%Y%m%d_%H%M%S)

Command-Line Options

Option	Description	Default
`--csv`	Path to models CSV file	`AI Energy Score (Oct 2025) - Models.csv`
`--output-dir`	Output directory for results	`./batch_results`
`--backend`	Backend type: `pytorch`, `vllm`	`pytorch`
`--num-prompts`	Number of prompts to run	All prompts in dataset
`--prompts-file`	Custom prompts file	HuggingFace dataset
`--model-name`	Filter by model name (substring)	-
`--class`	Filter by model class (A/B/C)	-
`--reasoning-state`	Filter by reasoning state	-
`--task`	Filter by task type (text_gen, etc.)	-

Output Structure

Results are organized with detailed logs and aggregated metrics:

batch_results/
├── master_results.csv          # Aggregated results from all runs
├── logs/                        # Debug logs for each model run
│   └── openai_gpt-oss-20b_On_High_*.log
└── individual_runs/             # Detailed per-model results
    └── openai_gpt-oss-20b_On_High/
        ├── benchmark_results.csv
        ├── GPU_ENERGY_WH.txt
        └── GPU_ENERGY_SUMMARY.json

Key Metrics in master_results.csv:

tokens_per_joule - Energy efficiency (higher = better)
avg_energy_per_prompt_wh - Energy cost per prompt (lower = better)
throughput_tokens_per_second - Generation speed
gpu_energy_wh - Total energy used
co2_emissions_g - Carbon emissions

Checking Results

# View aggregated results
cat batch_results/master_results.csv

# View with formatted columns
column -t -s',' batch_results/master_results.csv | less -S

# Check success/failure counts
tail -n +2 batch_results/master_results.csv | \
  awk -F',' '{if ($19 == "") print "success"; else print "failed"}' | \
  sort | uniq -c

# View debug logs
cat batch_results/logs/*.log

Model-Specific Handling

The batch runner automatically configures model-specific parameters:

gpt-oss models: Harmony formatting with reasoning effort levels
DeepSeek models: <think> prefix for thinking mode
Qwen models: enable_thinking parameter
Hunyuan models: /think prefix
EXAONE models: Inverted reasoning logic
Nemotron models: /no_think for reasoning disable

Using vLLM Backend

For the vLLM backend (direct execution), you need the full ai_energy_benchmarks package:

# Install ai_energy_benchmarks from parent directory
pip install -e ../ai_energy_benchmarks[pytorch]
pip install -r requirements.txt

# Start vLLM server
vllm serve openai/gpt-oss-20b --port 8000

# Run with vLLM backend
python batch_runner.py \
  --backend vllm \
  --endpoint http://localhost:8000/v1 \
  --model-name "gpt-oss" \
  --num-prompts 10

Troubleshooting

View available models:

python model_config_parser.py "AI Energy Score (Oct 2025) - Models.csv"

Test what models would run:

python -c "
from model_config_parser import ModelConfigParser
parser = ModelConfigParser('AI Energy Score (Oct 2025) - Models.csv')
configs = parser.parse()
filtered = parser.filter_configs(configs, model_name='gpt-oss')
for c in filtered:
    print(f'{c.model_id} - {c.reasoning_state}')
"

Missing dependencies:

pip install pandas  # If pandas not installed

Check logs for errors:

# View most recent log
ls -t batch_results/logs/*.log | head -1 | xargs cat

vLLM Backend (ai_energy_benchmarks)

# Terminal 1: Start vLLM server
vllm serve openai/gpt-oss-120b --port 8000

# Terminal 2: Run benchmark (sends requests to external vLLM server)
docker run --gpus all --shm-size 1g \
  --user $(id -u):$(id -g) \
  -v $(pwd)/results:/results \
  -e HOME=/home/user \
  -e BENCHMARK_BACKEND=vllm \
  -e VLLM_ENDPOINT=http://host.docker.internal:8000/v1 \
  ai_energy_score \
  --config-name text_generation \
  backend.model=openai/gpt-oss-120b

Note: vLLM backend requires a running vLLM server. The benchmark sends HTTP requests to measure energy under production-like serving conditions.

Environment Variables

Variable	Required	Default	Description
`BENCHMARK_BACKEND`	No	`pytorch`	Backend selection: `optimum`, `pytorch`, `vllm`
`VLLM_ENDPOINT`	Yes (for vLLM)	-	vLLM server endpoint (e.g., `http://localhost:8000/v1`)

All backends produce compatible output files (GPU_ENERGY_WH.txt, GPU_ENERGY_SUMMARY.json) that can be submitted to the AIEnergyScore portal.

Warning

It is essential to adhere to the following GPU usage guidelines:

If the model being tested is classified as a Class A or Class B model (generally models with fewer than 66B parameters, depending on quantization and precision settings), testing must be conducted on a single GPU.
Running tests on multiple GPUs for these model types will invalidate the results, as it may introduce inconsistencies and misrepresent the model’s actual performance under standard conditions.

Once the benchmarking has been completed, the zipped log files should be uploaded to the Submission Portal. The following terms and conditions will need to be accepted upon upload:

By checking the box below and submitting your energy score data, you confirm and agree to the following:

Public Data Sharing: You consent to the public sharing of the energy performance data derived from your submission. No additional information related to this model including proprietary configurations will be disclosed.
Data Integrity: You validate that the log files submitted are accurate, unaltered, and generated directly from testing your model as per the specified benchmarking procedures.
Model Representation: You verify that the model tested and submitted is representative of the production-level version of the model, including its level of quantization and any other relevant characteristics impacting energy efficiency and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
tests		tests
.gitignore		.gitignore
AIEnergyScore_Icon.png		AIEnergyScore_Icon.png
Dockerfile		Dockerfile
Example.png		Example.png
Example1.jpg		Example1.jpg
Example2.jpg		Example2.jpg
LICENSE		LICENSE
Label Specs.png		Label Specs.png
README.md		README.md
_base_.yaml		_base_.yaml
_config.yaml		_config.yaml
batch_runner.py		batch_runner.py
build.sh		build.sh
check_h100.py		check_h100.py
debug_logger.py		debug_logger.py
entrypoint.sh		entrypoint.sh
index.md		index.md
logo.png		logo.png
model_config_parser.py		model_config_parser.py
oct_2025_models.csv		oct_2025_models.csv
parameter_handler.py		parameter_handler.py
pytest.ini		pytest.ini
pytorch_validation.yaml		pytorch_validation.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
results_aggregator.py		results_aggregator.py
run_ai_energy_benchmark.py		run_ai_energy_benchmark.py
run_docker.sh		run_docker.sh
summarize_gpu_wh.py		summarize_gpu_wh.py
text_generation.yaml		text_generation.yaml

License

huggingface/AIEnergyScore

Folders and files

Latest commit

History

Repository files navigation

Key Links

Quick Start

1. Install Development Tools

2. Authenticate with HuggingFace (Optional - for gated models)

3. Build the Docker Image

4. Run Your First Benchmark

5. View Results

Quick Start: Batch Testing

🔐 Gated Model Support

Evaluating a Proprietary Model

Hardware

Usage

Building the Docker Image

Quick Start with Helper Script

Authentication for Gated Models

Manual Usage

Backend Selection

Default Usage (PyTorch/ai_energy_benchmarks)

Optimum Backend (optimum-benchmark)

Example: Comparing Energy Efficiency Across Models

Compare Small vs Large Models

Compare Different Model Families

Compare Reasoning vs Non-Reasoning Modes

Requirements Structure

Running Tests

Batch Testing Multiple Models

Quick Start (Docker Backend)

Common Usage Patterns

Command-Line Options

Output Structure

Checking Results

Model-Specific Handling

Using vLLM Backend

Troubleshooting

vLLM Backend (ai_energy_benchmarks)

Environment Variables

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages