A package for running OELLM CLI workflows across multiple HPC clusters using SLURM job arrays and Singularity containers.
- Schedule evaluations on multiple models and tasks on all clusters ✅
oellm schedule-eval ...
- Restart failed evaluations (e.g., due to node failures) ✅
oellm collect-results ... --reschedule true
- Interactive eval job/csv builder ✅
oellm build-csv
- Recursively resolve local paths: pass a directory containing models and their nested intermediate checkpoints, will eval all checkpoints
- Support default task groups (cf
oellm/task-groups.yaml
)
- Sync and download evaluation results from all clusters via a shared data layer
- Schedule training jobs on all clusters
- Schedule conversions from MegatronLM to HuggingFace
Prerequisites:
- install uv
# Install the package
uv tool install git+https://github.com/OpenEuroLLM/oellm-cli.git
# Run evaluations on multiple models and tasks
oellm schedule-eval \
--models "microsoft/DialoGPT-medium,EleutherAI/pythia-160m" \
--tasks "hellaswag,mmlu" \
--n_shot 5
This will automatically:
- Detect your current HPC cluster (Leonardo, LUMI, or JURECA)
- Download and cache the specified models and datasets
- Generate a SLURM job array to evaluate all model-task combinations
- Submit the jobs with appropriate cluster-specific resource allocations
In case you meet HuggingFace quotas issues, make sure you are logged in by setting your HF_TOKEN
and that you are part of OpenEuroLLM organization.
oellm interactive-csv
This will launch an interactive workflow where you can:
- Add models (HuggingFace Hub or local paths)
- Select evaluation tasks
- Configure n-shot settings
- Preview and save your evaluation configuration
Otherwise you can also directly schedule using a CSV file:
oellm schedule-eval --eval_csv_path custom_evals.csv
Due to the limit space in $HOME
on JSC clusters, you must set these uv
specific environment variables to avoid running out of space:
export UV_CACHE_DIR="<some-workspace-dir>/.cache/uv-cache"
export UV_INSTALL_DIR="<some-workspace>/.local"
export UV_PYTHON_INSTALL_DIR="<some-workspace>/.local/share/uv/python"
export UV_TOOL_DIR="<some-workspace-dir>/.cache/uv-tool-cache"
You can set these variables in your .bashrc
or .zshrc
file, depending on your shell of preference.
E.g., I have a user-folder in the synthlaion
project, so I set the following variables:
export UV_CACHE_DIR="/p/project1/synthlaion/$USER/.cache/uv-cache"
export UV_INSTALL_DIR="/p/project1/synthlaion/$USER/.local"
export UV_PYTHON_INSTALL_DIR="/p/project1/synthlaion/$USER/.local/share/uv/python"
export UV_TOOL_DIR="/p/project1/synthlaion/$USER/.cache/uv-tool-cache"
Install directly from the git repository using uv:
uv tool install git+https://github.com/OpenEuroLLM/oellm-cli.git
This makes the oellm
command available globally in your shell.
If you've already installed the package, you can run the following command to update it:
uv tool upgrade oellm
If you had previously installed the package from a different source and would like to overwrite it, you can run the following command:
uv tool install git+https://github.com/OpenEuroLLM/oellm-cli.git --force
The oellm
package orchestrates distributed LLM evaluations through the following workflow:
- Automatically detects the current HPC cluster based on hostname patterns
- Loads cluster-specific configurations from
clusters.yaml
including:- SLURM partition and account settings
- Shared storage paths for models, datasets, and results
- GPU allocation and queue limits
- Singularity container specifications
- Model Handling: Processes both local model checkpoints and Hugging Face Hub models
- For local paths: Automatically discovers and expands training checkpoint directories
- For HF models: Pre-downloads to shared cache (
$HF_HOME
) for offline access on compute nodes
- Dataset Caching: Pre-downloads all evaluation datasets using lm-evaluation-harness TaskManager
- Container Management: Ensures the appropriate Singularity container is available for the target cluster
- Creates a comprehensive CSV manifest of all model-task-shot combinations
- Generates a SLURM batch script from a template with cluster-specific parameters
- Submits a job array where each array task processes a subset of evaluations
- Respects queue limits and current user load to avoid overwhelming the scheduler
- Each SLURM array job runs in a Singularity container with:
- GPU access (NVIDIA CUDA or AMD ROCm as appropriate)
- Mounted shared storage for models, datasets, and output
- Offline execution using pre-cached resources
- Uses
lm-evaluation-harness
for the actual model evaluation - Outputs results as JSON files
Results are organized in timestamped directories under $EVAL_OUTPUT_DIR/$USER/
:
2024-01-15-14-30-45/
├── jobs.csv # Complete evaluation manifest
├── submit_evals.sbatch # Generated SLURM script
├── slurm_logs/ # SLURM output/error logs
└── results/ # Evaluation JSON outputs
Currently supports three HPC clusters:
- LEONARDO - NVIDIA A100 GPUs (CUDA)
- LUMI - AMD MI250X GPUs (ROCm)
- JURECA - NVIDIA A100 GPUs (CUDA)
Each cluster has pre-configured:
- Shared evaluation directories with appropriate quotas
- Optimized Singularity containers with evaluation dependencies
- Account and partition settings for the OpenEuroLLM project
Run in download-only mode to prepare resources without submitting jobs:
oellm schedule-eval --models "EleutherAI/pythia-160m" --tasks "hellaswag" --n_shot 0 --download_only True