Welcome to AI Energy Score! This is an initiative to establish comparable energy efficiency ratings for AI models, helping the industry make informed decisions about sustainability in AI development.
Get started benchmarking AI models in 5 steps:
cd AIEnergyScore
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install development dependencies (for running scripts on host)
pip install -r requirements-dev.txtIf you plan to test gated models (like Gemma or Llama):
# One-time login
hf auth login./build.sh# Quick test with 20 samples (default)
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b
# Or customize the number of samples
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-120bResults are saved in ./results/ with energy data in:
GPU_ENERGY_WH.txt- Total energy consumptionGPU_ENERGY_SUMMARY.json- Detailed metrics
Test multiple models automatically from a CSV configuration:
cd AIEnergyScore
# Install development dependencies (if not already done)
source .venv/bin/activate
pip install -r requirements-dev.txt
# Test a single model first (recommended)
python batch_runner.py \
--model-name "gpt-oss-20b" \
--reasoning-state "High" \
--num-prompts 3 \
--output-dir ./test_run
# Run all gpt-oss Class A models (smaller models)
python batch_runner.py \
--model-name gpt-oss \
--num-prompts 10 \
--class A
--output-dir ./gpt_oss_resultsResults are aggregated in batch_results/master_results.csv with detailed logs in batch_results/logs/. See Batch Testing Multiple Models for full documentation.
Next Steps:
AIEnergyScore supports automatic authentication for gated models on HuggingFace! Simply run hf auth login once (Step 2 above), and you'll have seamless access to models like Gemma, Llama, and other restricted models. See Authentication for Gated Models for details.
The Dockerfile provided in this repository is made to be used on the NVIDIA H100-80GB GPU. If you would like to run benchmarks on other types of hardware, we invite you to take a look at these configuration examples that can be run directly with AI Energy Benchmark. However, evaluations completed on other hardware would not be currently compatable and comparable with the rest of the AI Energy Score data.
The Docker image includes both optimum-benchmark and ai_energy_benchmarks. Use the provided build script:
./build.shFor convenience, use the provided helper script that handles all volume mounts automatically:
cd AIEnergyScore
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20bThe helper script automatically:
- Runs container as current user
- Mounts HuggingFace cache from
~/.cache/huggingface - Automatically detects and mounts HuggingFace authentication tokens (for gated models)
- Creates and mounts results directory
- Configures proper environment variables
- Defaults to 20 prompts (customize with
-nor--num-samples)
Note for Gated Models: If you need to access gated models (like
google/gemma-3-4b-ptor Meta Llama models), runhuggingface-cli loginfirst. See Authentication for Gated Models for details.
Examples:
# Use default 20 samples (default)
./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20b
# Test with 100 samples
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-120b
# Full test with 1000 samples
./run_docker.sh --num-samples 1000 --config-name text_generation backend.model=openai/gpt-oss-20b
# Use Optimum backend with HuggingFace optimum-benchmark
BENCHMARK_BACKEND=optimum ./run_docker.sh --config-name text_generation backend.model=openai/gpt-oss-20bSome models on HuggingFace (e.g., google/gemma-3-1b-pt, Meta Llama models) require authentication to access. The run_docker.sh script automatically handles authentication using two methods:
Method 1: HuggingFace CLI Login (Recommended)
The easiest way to authenticate is using the HuggingFace CLI:
# One-time setup: login to HuggingFace
huggingface-cli login #legacy
or
hf auth login
# Then use normally - token is automatically mounted
./run_docker.sh --config-name text_generation backend.model=google/gemma-3-1b-ptThis creates a token file at ~/.huggingface/token which is automatically detected and mounted by run_docker.sh.
Method 2: HF_TOKEN Environment Variable
Alternatively, you can pass your token explicitly:
# Get your token from https://huggingface.co/settings/tokens
export HF_TOKEN=hf_your_token_here
# Run with token from environment
HF_TOKEN=hf_xxx ./run_docker.sh --config-name text_generation backend.model=google/gemma-3-1b-ptThe run_docker.sh script will display a warning if no authentication is found when you run it.
Alternatively, you can run your benchmark manually. Important: Create the results directory first to avoid permission errors:
# Create results directory with proper permissions
mkdir -p results
#for example
docker run --gpus all --shm-size 1g \
--user $(id -u):$(id -g) \
-v ~/.cache/huggingface:/home/user/.cache/huggingface \
-v $(pwd)/results:/results \
-e HOME=/home/user \
ai_energy_score \
--config-name text_generation \
scenario.num_samples=3 \
backend.model=openai/gpt-oss-20b
# For gated models, add token file mount and/or HF_TOKEN
docker run --gpus all --shm-size 1g \
--user $(id -u):$(id -g) \
-v ~/.cache/huggingface:/home/user/.cache/huggingface \
-v ~/.huggingface/token:/home/user/.huggingface/token:ro \
-v $(pwd)/results:/results \
-e HOME=/home/user \
-e HF_TOKEN=hf_your_token_here \
ai_energy_score \
--config-name text_generation \
backend.model=google/gemma-3-1b-ptwhere my_task is the name of a task with a configuration here, my_model is the name of your model that you want to test (which needs to be compatible with either the Transformers or the Diffusers libraries) and my_processor is the name of the tokenizer/processor you want to use. In most cases, backend.model and backend.processor will be identical, except in cases where a model is using another model's tokenizer (e.g. from a LLaMa model).
The rest of the configuration is explained here
AIEnergyScore supports multiple benchmark backends for flexibility and validation:
| Backend | Tool | Load Generation | Model Location | Use Case |
|---|---|---|---|---|
pytorch (default) |
ai_energy_benchmarks | ai_energy_benchmarks generates load | Local GPU (in container) | Standard AIEnergyScore benchmarks |
optimum |
optimum-benchmark | optimum-benchmark generates load | Local GPU (in container) | Alternative HuggingFace backend |
vllm |
ai_energy_benchmarks | ai_energy_benchmarks generates load | External vLLM server | Production load testing |
Default Backend (PyTorch):
The default pytorch backend uses the ai_energy_benchmarks framework, which loads models directly from HuggingFace or local paths for inference. This backend provides full control over model configuration including quantization, device mapping, and multi-GPU support. It measures raw model performance without serving overhead, making it ideal for controlled experiments and head-to-head model comparisons. The PyTorch backend automatically handles model sharding across multiple GPUs for large models and supports reasoning-capable models with automatic prompt formatting.
# Standard AIEnergyScore benchmark - run as current user with cache mounting
docker run --gpus all --shm-size 1g \
--user $(id -u):$(id -g) \
-v ~/.cache/huggingface:/home/user/.cache/huggingface \
-v $(pwd)/results:/results \
-e HOME=/home/user \
ai_energy_score \
--config-name text_generation \
scenario.num_samples=20 \
backend.model=openai/gpt-oss-120bVolume mounts:
~/.cache/huggingface:/home/user/.cache/huggingface- Reuse local HuggingFace model cache (avoids re-downloading)$(pwd)/results:/results- Persist benchmark results to local directory--user $(id -u):$(id -g)- Run as current user (not root) for proper file permissions-e HOME=/home/user- Set HOME environment variable for HuggingFace cache location
# Use HuggingFace optimum-benchmark backend
docker run --gpus all --shm-size 1g \
--user $(id -u):$(id -g) \
-v ~/.cache/huggingface:/home/user/.cache/huggingface \
-v $(pwd)/results:/results \
-e HOME=/home/user \
-e BENCHMARK_BACKEND=optimum \
ai_energy_score \
--config-name text_generation \
backend.model=openai/gpt-oss-20bNote: With BENCHMARK_BACKEND=pytorch, ai_energy_benchmarks loads the model and generates inference load directly on the GPU, just like optimum-benchmark.
AIEnergyScore makes it easy to compare the energy efficiency of different models. Here are practical examples:
cd AIEnergyScore
# Benchmark a smaller model (Class A: ~3B parameters)
./run_docker.sh -n 100 --config-name text_generation backend.model=HuggingFaceTB/SmolLM3-3B
# Benchmark a larger model (Class B: ~20B parameters)
./run_docker.sh -n 100 --config-name text_generation backend.model=openai/gpt-oss-20b# Benchmark Gemma family model
./run_docker.sh -n 100 --config-name text_generation backend.model=google/gemma-3-4b-pt
# Benchmark Qwen family model
./run_docker.sh -n 100 --config-name text_generation backend.model=Qwen/Qwen2.5-Coder-14B
# Benchmark Mistral family model
./run_docker.sh -n 100 --config-name text_generation backend.model=mistralai/Mistral-Nemo-Instruct-2407# Test with reasoning disabled (fixed 10 tokens)
./run_docker.sh -n 20 \
--config-name text_generation \
backend.model=openai/gpt-oss-20b \
scenario.reasoning=False
# Test with low reasoning effort
./run_docker.sh -n 20 \
--config-name text_generation \
backend.model=openai/gpt-oss-20b \
scenario.reasoning=True \
scenario.reasoning_params.reasoning_effort=low
# Test with medium reasoning effort
./run_docker.sh -n 20 \
--config-name text_generation \
backend.model=openai/gpt-oss-20b \
scenario.reasoning=True \
scenario.reasoning_params.reasoning_effort=medium
# Test with high reasoning effort
./run_docker.sh -n 20 \
--config-name text_generation \
backend.model=openai/gpt-oss-20b \
scenario.reasoning=True \
scenario.reasoning_params.reasoning_effort=highNote: Reasoning parameters are configured via Hydra command-line overrides. The system automatically detects the model type and applies the appropriate formatting (e.g., Harmony format for gpt-oss models). Legacy config files (text_generation_gptoss_reasoning_*.yaml) are deprecated but still functional for backward compatibility.
After running these benchmarks, results are saved in ./results/ with energy consumption data in GPU_ENERGY_WH.txt and GPU_ENERGY_SUMMARY.json files.
AIEnergyScore uses two separate requirements files:
| File | Purpose | Usage |
|---|---|---|
requirements.txt |
Runtime dependencies for the Docker container | Installed automatically during ./build.sh |
requirements-dev.txt |
Development/deployment tools for the host machine | Install with pip install -r requirements-dev.txt |
Development requirements include:
huggingface-hub[cli]- Model downloads and authenticationpandas- Batch runner CSV processingpytest,ruff,mypy,black- Testing and code quality toolsdocker,python-dotenv,pyyaml- Container and config management
cd AIEnergyScore
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
pytestUse pytest -m e2e to run only the end-to-end suites; omit the marker filter to execute the full test collection.
The batch_runner.py script enables automated testing of multiple models from a CSV configuration file, with support for model-specific parameters and reasoning configurations.
For the PyTorch backend (default), you only need development dependencies installed locally - all AI work runs in Docker:
cd AIEnergyScore
# Create virtual environment and install development dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
# Test a single model first (recommended)
python batch_runner.py \
--model-name "gpt-oss-20b" \
--reasoning-state "High" \
--num-prompts 3 \
--output-dir ./test_run
# Run batch tests (uses Docker internally)
python batch_runner.py \
--model-name "gemma" \
--output-dir ./results/gemma \
--num-prompts 10# Test all gpt-oss models
python batch_runner.py --model-name "gpt-oss" --num-prompts 20
# Test specific model class
python batch_runner.py --class A --num-prompts 50 # Small models
python batch_runner.py --class B --num-prompts 10 # Medium models
python batch_runner.py --class C --num-prompts 5 # Large models
# Filter by reasoning state
python batch_runner.py --reasoning-state "High" --num-prompts 10
# Combine filters
python batch_runner.py \
--model-name gpt-oss \
--reasoning-state "High" \
--num-prompts 10
# Full benchmark run with timestamped output
python batch_runner.py \
--output-dir ./full_results_$(date +%Y%m%d_%H%M%S)| Option | Description | Default |
|---|---|---|
--csv |
Path to models CSV file | AI Energy Score (Oct 2025) - Models.csv |
--output-dir |
Output directory for results | ./batch_results |
--backend |
Backend type: pytorch, vllm |
pytorch |
--num-prompts |
Number of prompts to run | All prompts in dataset |
--prompts-file |
Custom prompts file | HuggingFace dataset |
--model-name |
Filter by model name (substring) | - |
--class |
Filter by model class (A/B/C) | - |
--reasoning-state |
Filter by reasoning state | - |
--task |
Filter by task type (text_gen, etc.) | - |
Results are organized with detailed logs and aggregated metrics:
batch_results/
├── master_results.csv # Aggregated results from all runs
├── logs/ # Debug logs for each model run
│ └── openai_gpt-oss-20b_On_High_*.log
└── individual_runs/ # Detailed per-model results
└── openai_gpt-oss-20b_On_High/
├── benchmark_results.csv
├── GPU_ENERGY_WH.txt
└── GPU_ENERGY_SUMMARY.json
Key Metrics in master_results.csv:
tokens_per_joule- Energy efficiency (higher = better)avg_energy_per_prompt_wh- Energy cost per prompt (lower = better)throughput_tokens_per_second- Generation speedgpu_energy_wh- Total energy usedco2_emissions_g- Carbon emissions
# View aggregated results
cat batch_results/master_results.csv
# View with formatted columns
column -t -s',' batch_results/master_results.csv | less -S
# Check success/failure counts
tail -n +2 batch_results/master_results.csv | \
awk -F',' '{if ($19 == "") print "success"; else print "failed"}' | \
sort | uniq -c
# View debug logs
cat batch_results/logs/*.logThe batch runner automatically configures model-specific parameters:
- gpt-oss models: Harmony formatting with reasoning effort levels
- DeepSeek models:
<think>prefix for thinking mode - Qwen models:
enable_thinkingparameter - Hunyuan models:
/thinkprefix - EXAONE models: Inverted reasoning logic
- Nemotron models:
/no_thinkfor reasoning disable
For the vLLM backend (direct execution), you need the full ai_energy_benchmarks package:
# Install ai_energy_benchmarks from parent directory
pip install -e ../ai_energy_benchmarks[pytorch]
pip install -r requirements.txt
# Start vLLM server
vllm serve openai/gpt-oss-20b --port 8000
# Run with vLLM backend
python batch_runner.py \
--backend vllm \
--endpoint http://localhost:8000/v1 \
--model-name "gpt-oss" \
--num-prompts 10View available models:
python model_config_parser.py "AI Energy Score (Oct 2025) - Models.csv"Test what models would run:
python -c "
from model_config_parser import ModelConfigParser
parser = ModelConfigParser('AI Energy Score (Oct 2025) - Models.csv')
configs = parser.parse()
filtered = parser.filter_configs(configs, model_name='gpt-oss')
for c in filtered:
print(f'{c.model_id} - {c.reasoning_state}')
"Missing dependencies:
pip install pandas # If pandas not installedCheck logs for errors:
# View most recent log
ls -t batch_results/logs/*.log | head -1 | xargs cat# Terminal 1: Start vLLM server
vllm serve openai/gpt-oss-120b --port 8000
# Terminal 2: Run benchmark (sends requests to external vLLM server)
docker run --gpus all --shm-size 1g \
--user $(id -u):$(id -g) \
-v $(pwd)/results:/results \
-e HOME=/home/user \
-e BENCHMARK_BACKEND=vllm \
-e VLLM_ENDPOINT=http://host.docker.internal:8000/v1 \
ai_energy_score \
--config-name text_generation \
backend.model=openai/gpt-oss-120bNote: vLLM backend requires a running vLLM server. The benchmark sends HTTP requests to measure energy under production-like serving conditions.
| Variable | Required | Default | Description |
|---|---|---|---|
BENCHMARK_BACKEND |
No | pytorch |
Backend selection: optimum, pytorch, vllm |
VLLM_ENDPOINT |
Yes (for vLLM) | - | vLLM server endpoint (e.g., http://localhost:8000/v1) |
All backends produce compatible output files (GPU_ENERGY_WH.txt, GPU_ENERGY_SUMMARY.json) that can be submitted to the AIEnergyScore portal.
Warning
It is essential to adhere to the following GPU usage guidelines:
- If the model being tested is classified as a Class A or Class B model (generally models with fewer than 66B parameters, depending on quantization and precision settings), testing must be conducted on a single GPU.
- Running tests on multiple GPUs for these model types will invalidate the results, as it may introduce inconsistencies and misrepresent the model’s actual performance under standard conditions.
Once the benchmarking has been completed, the zipped log files should be uploaded to the Submission Portal. The following terms and conditions will need to be accepted upon upload:
By checking the box below and submitting your energy score data, you confirm and agree to the following:
- Public Data Sharing: You consent to the public sharing of the energy performance data derived from your submission. No additional information related to this model including proprietary configurations will be disclosed.
- Data Integrity: You validate that the log files submitted are accurate, unaltered, and generated directly from testing your model as per the specified benchmarking procedures.
- Model Representation: You verify that the model tested and submitted is representative of the production-level version of the model, including its level of quantization and any other relevant characteristics impacting energy efficiency and performance.
