Welcome to the aihpi-cluster workshop! This hands-on tutorial teaches you how to submit and manage distributed training jobs on SLURM clusters using the aihpi package.
By the end of this workshop, you will:
- β Submit single-node and multi-node distributed training jobs
- β Understand SLURM job configuration and resource allocation
- β Use containers for reproducible training environments
- β Integrate real ML frameworks like LlamaFactory
- β Monitor and debug your training jobs
- β Create custom training workflows
- Python β₯ 3.8
- Access to SLURM cluster with Pyxis/Enroot support
- SSH access to cluster login node
- Basic familiarity with Python and distributed training concepts
# Clone or download this workshop
git clone <workshop-repo-url> aihpi-cluster-workshop
cd aihpi-cluster-workshop
# Run the setup script (installs aihpi + LlamaFactory)
./setup.shIMPORTANT: Before running examples, update the login_node parameter in each example file:
config = JobConfig(
    # ... other settings ...
    login_node="YOUR.LOGIN.NODE.IP",  # π₯ Update this!
)Replace YOUR.LOGIN.NODE.IP with your actual SLURM login node IP address.
Follow the progressive examples:
# Example 1: Single-node job submission
cd examples/
python 01_single_node.py
# Example 2: Multi-node distributed training  
python 02_distributed.py
# Example 3: LlamaFactory integration
python 03_llamafactory.py
# Example 4: Custom job template
python 04_custom_job.pyaihpi-cluster-workshop/
βββ π setup.sh              # One-command environment setup
βββ π README.md             # This guide
βββ π requirements.txt      # Python dependencies
βββ π examples/             # Progressive learning examples
β   βββ π― 01_single_node.py # Start here: Basic job submission
β   βββ π 02_distributed.py # Multi-node distributed training
β   βββ π¦ 03_llamafactory.py # Real LLM training integration
β   βββ π οΈ 04_custom_job.py  # Template for your own jobs
β   βββ π configs/          # Example configuration files
β       βββ basic_llama_sft.yaml
βββ π οΈ utils/                # Helpful utilities
β   βββ monitor.py          # Job monitoring tool
βββ π LLaMA-Factory/        # Cloned LlamaFactory repo (after setup)
βββ π requirements.txt     # Dependencies list
| Example | Topic | Duration | Key Concepts | 
|---|---|---|---|
| 01 | Single-Node Jobs | 15 min | JobConfig, basic submission, monitoring | 
| 02 | Distributed Training | 20 min | Multi-node, environment variables, containers | 
| 03 | LlamaFactory Integration | 25 min | Real ML workflows, workspace mounting | 
| 04 | Custom Jobs | 15 min | Template for your own research | 
Total Time: ~75 minutes
Learn the basics of job submission:
from aihpi import SlurmJobExecutor, JobConfig
config = JobConfig(
    job_name="my-first-job",
    num_nodes=1,
    gpus_per_node=1,
    walltime="00:10:00",
    partition="aisc",
    login_node="10.130.0.6",  # Your login node IP
)
executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)Scale to multiple nodes:
config = JobConfig(
    job_name="distributed-training",
    num_nodes=2,              # Multiple nodes!
    gpus_per_node=1,
    walltime="00:15:00",
    partition="aisc",
    login_node="10.130.0.6",
)
# aihpi automatically sets up:
# - MASTER_ADDR, NODE_RANK, WORLD_SIZE
# - Inter-node communication
# - Distributed coordination
executor = SlurmJobExecutor(config)
job = executor.submit_distributed_training(distributed_function)Real LLM training:
config = JobConfig(
    job_name="llm-training",
    num_nodes=2,
    gpus_per_node=1,
    workspace_mount=Path("./LLaMA-Factory"),
    # ... container and mount configuration
)
executor = SlurmJobExecutor(config)
job = executor.submit_llamafactory_training("configs/basic_llama_sft.yaml")| Parameter | Description | Example | 
|---|---|---|
| job_name | Unique job identifier | "my-experiment-v1" | 
| num_nodes | Number of compute nodes | 1(single),2+(distributed) | 
| gpus_per_node | GPUs per node | 1,2,4,8 | 
| walltime | Maximum job duration | "01:30:00"(1.5 hours) | 
| partition | SLURM partition/queue | "aisc","gpu" | 
| login_node | SSH target IP | "10.130.0.6" | 
from aihpi import ContainerConfig
config.container = ContainerConfig(
    name="torch2412",                    # Container image
    mounts=[
        "/data:/workspace/data",         # host:container paths
        "/dev/infiniband:/dev/infiniband" # InfiniBand support
    ]
)config.env_vars = {
    "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128",
    "NCCL_DEBUG": "INFO",
    "MY_EXPERIMENT_NAME": "workshop_v1"
}# Monitor specific job
python utils/monitor.py 12345
# List all your jobs  
python utils/monitor.py --list
# Stream job logs
python utils/monitor.py --logs 12345# Check job status
squeue -u $USER
# Detailed job info
scontrol show job 12345
# Job history
sacct -j 12345
# Cancel job
scancel 12345Jobs create logs in logs/aihpi/:
logs/aihpi/
βββ workshop-job_12345_2024-09-09_19-30-45/
    βββ stdout.log    # Job output
    βββ stderr.log    # Error messages  
    βββ submitit.log  # SLURM submission details
| Problem | Solution | 
|---|---|
| SSH connection failed | Check login_nodeIP address | 
| Job stuck in PENDING | Check partition availability: sinfo | 
| Container not found | Verify container name: enroot list | 
| Out of memory | Reduce batch size or increase nodes | 
| Permission denied | Check file permissions and SSH keys | 
- login_node IP is correct and accessible via SSH
-  Partition exists and you have access (sinfo)
-  Container image available (enroot list)
- Paths exist and are accessible from compute nodes
- SSH keys configured for passwordless access
- Resource limits are reasonable for your partition
- Start Small: Test with 1 node, short walltime
- Monitor Actively: Check logs and resource usage
- Scale Gradually: Increase resources once working
- Use Containers: For reproducible environments
- Meaningful Names: Use descriptive job names
| Training Type | Nodes | GPUs/Node | Walltime | Memory | 
|---|---|---|---|---|
| Debugging | 1 | 1 | 00:15:00 | 16GB | 
| Small Models | 1-2 | 1-2 | 02:00:00 | 32GB | 
| Large Models | 2-8 | 2-4 | 08:00:00 | 64GB+ | 
| Production | 4-16 | 4-8 | 24:00:00 | 128GB+ | 
- Never commit secrets (API keys, tokens) to code
- Use environment variables for sensitive data
- Respect cluster resources - don't waste compute time
- Follow data policies for datasets and models
After completing the workshop:
- Adapt Examples: Modify templates for your research
- Explore Advanced Features:
- Experiment tracking (Weights & Biases, MLflow)
- Custom containers and environments
- Advanced SLURM configurations
 
- Join the Community: Share experiences and get help
- Contribute: Submit bug reports and improvements
- π Documentation: Check the main aihpi repository README
- π Issues: Report bugs on GitHub
- π¬ Questions: Ask on discussion forums
- π§ Contact: Reach out to workshop organizers
Congratulations! You now know how to:
- Submit distributed training jobs with aihpi
- Configure SLURM resources effectively
- Monitor and debug your jobs
- Integrate with real ML frameworks
Happy Training! π
This workshop was created for the aihpi-cluster project. For more information, visit the main repository.