Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 31 additions & 65 deletions examples/speculative_decoding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,16 @@ This one-line command runs a minimal example workflow of training and exporting
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data input_conversations/train.jsonl \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.data_path=input_conversations/train.jsonl \
training.output_dir=ckpts/llama-3.2-1b-online
```

FSDP2 is used by default. To enable context parallelism for long-context training, specify `--cp_size n`.
All default training settings live in `eagle3.yaml`; override any field via OmegaConf dotlist arguments on the command line.

To enable context parallelism for long-context training, add `training.cp_size=<N>` to the overrides.
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.

## Training Draft Model with Offline Base Model
Expand Down Expand Up @@ -113,15 +115,14 @@ python collect_hidden_states/compute_hidden_states_hf.py \

### Train Draft Model with Dumped Hidden States

Once we finish dumping hidden states, launch offline training with an extra `--offline-data` argument:
Once we finish dumping hidden states, launch offline training pointing to the hidden states directory:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json \
--offline-data $HIDDEN_STATES_DIR
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.offline_data_path=$HIDDEN_STATES_DIR \
training.output_dir=ckpts/llama-3.2-1b-offline
```

## Model Validation
Expand Down Expand Up @@ -244,13 +245,13 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d

### Configuring Draft Model

For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. E.g. To use 2-layer eagle with 8192 intermediate size for MLP, set `eagle_config.json` to:
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings via `eagle.eagle_architecture_config` in the YAML. E.g. to use a 2-layer EAGLE head with 8192 intermediate size:

```json
{
"num_hidden_layers": 2,
"intermediate_size":8192
}
```yaml
eagle:
eagle_architecture_config:
num_hidden_layers: 2
intermediate_size: 8192
```

### Draft Vocabulary Compression
Expand All @@ -263,61 +264,26 @@ python scripts/calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct

This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.

Then, simply set `{"draft_vocab_size":32000}` in `eagle_config.json` and include `--draft_vocab_cache <path_to_d2t.pt>` when running `./launch_train.sh`. The draft model will use this provided vocab table during training and export.
Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the full nested YAML path for draft_vocab_size.

The runtime schema nests this under eagle, so eagle_architecture_config.draft_vocab_size reads like a top-level key. Use eagle.eagle_architecture_config.draft_vocab_size to match the actual config structure.

Suggested fix
-Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
+Then, set `eagle.eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
Then, set `eagle.eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/README.md` at line 261, The README uses the
wrong top-level key for the draft vocab size; update the YAML example to the
correct nested key eagle.eagle_architecture_config.draft_vocab_size (keep
data.draft_vocab_cache as shown) and search/update any other documentation or
examples that reference eagle_architecture_config.draft_vocab_size so they use
eagle.eagle_architecture_config.draft_vocab_size to match the runtime schema.


### Interact with `modelopt.torch.speculative`

`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
First, load the base model and tokenizer from Hugging Face:

```python
model = transformers.AutoModelForCausalLM.from_pretrained(
"<path to your pretrained model>"
)
```

Then, load default eagle config and make necessary overwrites:
`main.py` provides a complete example for converting a HF base model for speculative decoding and training it. The core steps are loading the base model, converting it with an eagle config dict, and training with HF Trainer:

```python
# Load default config
config = {
"eagle1": EAGLE1_DEFAULT_CFG,
"eagle3": EAGLE3_DEFAULT_CFG,
}[training_args.mode]["config"]

# overwrite config with custom config
config["eagle_architecture_config"].update({"<overwrite_keys>": "<overwrite_values>"})

# Mandatory: hidden size, vocab size and max position embeddings must match base model
config["eagle_architecture_config"].update(
{
"hidden_size": model.config.hidden_size,
"vocab_size": model.config.vocab_size,
"max_position_embeddings": model.config.max_position_embeddings,
}
)
```
import modelopt.torch.speculative as mtsp

Then, we convert model to a speculative decoding model:
# Convert base model in-place to an EAGLE speculative decoding model
eagle_cfg = {"eagle_decoder_type": "llama", ...} # fields from EagleConfig
mtsp.convert(model, [("eagle", eagle_cfg)])

```python
mtsp.convert(model, [("eagle", config)])
# Train with HF Trainer as usual
trainer = transformers.Trainer(model=model, ...)
trainer.train()
trainer.save_model("<output_dir>")
```

This will modify the model in-place with eagle training forward, making it compatible with HF trainer:

```python
# Create a trainer
trainer = transformers.Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
trainer._move_model_to_device(model, trainer.args.device)

# Enable HF checkpointing so that the saved model will contain the speculative decoding module
mto.enable_huggingface_checkpointing()

trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_state()
trainer.save_model("<path to the output directory>")
```
See `main.py` for the full example including tokenizer setup, dataset loading, and checkpoint handling.

## Support Matrix

Expand Down
2 changes: 0 additions & 2 deletions examples/speculative_decoding/eagle_config.json

This file was deleted.

2 changes: 1 addition & 1 deletion examples/speculative_decoding/eagle_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ class EagleTrainingPlot(TrainerCallback):

def __init__(self, ar_validate_steps: int = 1000, estimate_ar: bool = False):
self.ar_validate_steps = ar_validate_steps
if wandb and is_master():
if hasattr(wandb, "init") and is_master():
wandb.init()
self.estimate_ar = estimate_ar

Expand Down
1 change: 0 additions & 1 deletion examples/speculative_decoding/fsdp_config.json

This file was deleted.

Loading
Loading