This guide documents the integration of AutoExperiment with MegatronLM for automated large-scale language model training experiments. The system orchestrates complex multi-phase training workflows including pretraining, cooldown phases, checkpoint conversion, and evaluation - all managed through declarative YAML configurations.
- AutoExperiment: Orchestration framework that manages job scheduling, monitoring, and workflow execution
- MegatronLM: Distributed training framework for large language models
- Slurm: Job scheduler for HPC cluster management
- Helper Scripts: Auxiliary scripts for checkpoint management and conversion monitoring
- Automated Multi-Phase Training: Supports pretraining and cooldown phases with different hyperparameters
- Dynamic Checkpoint Conversion: Automatically converts Megatron checkpoints to HuggingFace format
- Continuous Evaluation: Runs evaluation benchmarks on converted checkpoints
- Parallel Experiment Management: Handles multiple model sizes and hyperparameter configurations simultaneously
- Fault Tolerance: Includes checkpoint tracking and resume capabilities
The primary configuration file defines the entire experimental setup using a hierarchical structure:
# Job scheduling parameters
cmd: "sbatch {sbatch_script}"
check_interval_secs: 600
# Cluster configuration
PARTITION: booster
ACCOUNT: projectnucleus
TIME: 360
NODES: 1
# Experiment definitions
EXPERIMENTS:
- 10M:
NUM_LAYERS: 5
HIDDEN_SIZE: 160
NUM_ATTN_HEADS: 4
- 25M:
NUM_LAYERS: 9
HIDDEN_SIZE: 288
NUM_ATTN_HEADS: 4
# ... more model sizesDefines multiple model sizes with their architectural parameters:
- Number of layers, hidden size, attention heads
- FFN hidden size (typically 4x hidden size)
- Model naming convention
- Learning rates with sweep capabilities:
[1e-2, 5e-3, 1e-3, 5e-4, 1e-4] - Weight decay coupled to LR:
LR*WD = 1e-4 - Batch sizes:
[4, 16, 32]with automatic global batch calculation - Warmup iterations based on model size and batch size
Two-phase training approach:
- PRETRAIN: Main training phase with WSD (Warm Start Decay) schedule
- COOLDOWN: Additional training with extended decay for improved performance
Each phase supports multiple operational modes:
- TRAIN: Actual model training
- CONVERT_EVAL: Checkpoint conversion and evaluation
Monitors checkpoint directories and identifies unconverted checkpoints ready for processing.
Key Functions:
- Scans checkpoint directories for completed checkpoints
- Tracks conversion status using marker files
- Prevents duplicate conversions with lock files
- Returns count of pending conversions
Usage:
./convert_helper.sh <RUN_DIR> [MIN_ITER]Manages checkpoint discovery and symlink creation for cooldown phase initialization.
Key Functions:
- Finds appropriate checkpoint from pretraining phase
- Creates symbolic links for cooldown initialization
- Updates latest checkpoint iteration markers
Usage:
./cooldown_helper.sh <LOGS> <EXP_NAME> <ITER> <CD_SCALE>SBATCH template for training jobs with containerized execution.
Features:
- Configures distributed training environment
- Sets up NCCL parameters for multi-node communication
- Manages checkpoint paths and tensorboard logging
- Handles container execution with Singularity/Apptainer
SBATCH template for checkpoint conversion and evaluation.
Features:
- Converts Megatron checkpoints to HuggingFace format
- Processes ALL unconverted checkpoints in sequence
- Runs evaluation benchmarks on converted models
- Maintains conversion tracking state
AutoExperiment expands the configuration to create all experiment combinations. For example:
MODEL × LR × BATCH_SIZE × BIAS_SETTING × PHASE × MODE
graph TD
A[Parse Config] --> B[Generate Experiments]
B --> C[Submit PRETRAIN Jobs]
C --> D[Monitor Training]
D --> E{Training Complete?}
E -->|Yes| F[Trigger Conversion]
E -->|No| D
F --> G[Convert Checkpoints]
G --> H[Run Evaluation]
H --> I{Cooldown Needed?}
I -->|Yes| J[Start Cooldown]
I -->|No| K[Complete]
J --> D
The system maintains three tracking directories:
checkpoints/: Raw Megatron checkpointsconverted_checkpoints/: Marker files for completed conversionsin_progress_checkpoints/: Lock files for ongoing conversions
Start Conditions:
- Training jobs start immediately upon submission
- Conversion jobs use
start_condition_cmdto check for unconverted checkpoints - Cooldown jobs wait for specific checkpoint iterations
Termination Conditions:
- Training: Checks for "after training is done" in logs
- Conversion: Monitors both conversion completion and training status
The configuration supports computed values using expr():
GLOBAL_BATCH_SIZE: "expr(int(({NUM_GPUS} * {MICRO_BATCH_SIZE} * {GAS})/{TP}))"
LR_WARMUP_ITERS: "expr(min(2*(12*({HIDDEN_SIZE}**2)*{NUM_LAYERS} + {HIDDEN_SIZE}*{VOCAB_SIZE}) // ({SEQ_LENGTH} * {GLOBAL_BATCH_SIZE}), 5000))"Supports multiple cooldown durations as percentage of total tokens:
CD_SCALE:
- 100M: {TOTAL_TOKENS_THIS_PHASE: 100_000_000}
- 200M: {TOTAL_TOKENS_THIS_PHASE: 200_000_000}
- 1B: {TOTAL_TOKENS_THIS_PHASE: 1_000_000_000}Tests models with and without bias terms:
BIAS:
- WITH: {DISABLE_BIAS_LINEAR: "", BIAS_NAME: "with"}
- WITHOUT: {DISABLE_BIAS_LINEAR: "--disable-bias-linear", BIAS_NAME: "without"}Edit scaling_exps.yaml to define your experiments:
EXPERIMENTS:
- 50M:
NUM_LAYERS: 12
HIDDEN_SIZE: 384
NUM_ATTN_HEADS: 6Configure hyperparameters:
LR: [1e-3, 5e-4]
MICRO_BATCH_SIZE: [16, 32]
TOTAL_TOKENS_NUM: 2_000_000_000autoexperiment build-and-run scaling_exps.yamlAutoExperiment will:
- Submit all training jobs
- Monitor completion status
- Trigger conversions automatically
- Launch cooldown phases when ready
- Generate evaluation results
LOGS/
├── <DATASET>_<MODEL>_lr<LR>_b1_<BETA1>_b2_<BETA2>_wd<WD>_w<WARMUP>_n<NODES>_bs<BS>__<BIAS>Bias/
│ ├── checkpoints/
│ │ ├── iter_0000100/
│ │ ├── iter_0000200/
│ │ └── latest_checkpointed_iteration.txt
│ ├── converted_hf/
│ │ ├── iter_100/
│ │ └── iter_200/
│ ├── eval_results/
│ │ ├── iter_100/results.json
│ │ └── iter_200/results.json
│ ├── tensorboard/
│ ├── converted_checkpoints/
│ │ ├── iter_100.done
│ │ └── iter_200.done
│ └── cooldown_s<SCALE>/
│ ├── checkpoints/
│ ├── converted_hf/
│ └── eval_results/
-
Checkpoints Not Converting
- Verify checkpoint completeness (check for
model_optim_rng.pt) - Ensure tracking directories have write permissions
- Check SBATCH output logs for errors
- Verify checkpoint completeness (check for
-
Jobs Not Starting
- Verify
start_condition_cmdreturns expected values - Check Slurm queue and resource availability
- Ensure helper scripts are executable
- Verify
-
Evaluation Failures
- Confirm HuggingFace model files copied correctly
- Verify tokenizer compatibility
-
Cooldown Phase Issues
- Ensure checkpoint iteration matches expected format
- Verify symlink creation permissions
- Check checkpoint path resolution
- Define architecture in
EXPERIMENTSsection - Update
CONVERSIONparameters if needed - Modify HuggingFace config template in conversion script
- Place evaluation configs in
TASKS_PATH - Update
--tasksparameter in evaluation command - Modify output parsing if needed
Extend DATASET configuration:
DATASET:
- C4: {DATA_PATH: "...", DATASET_NAME: "c4"}
- PILE: {DATA_PATH: "...", DATASET_NAME: "pile"}