Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
4afc97d
Move FP8 autocast logic from training scripts into OG2 model
savitha-eng Mar 13, 2026
5e80225
Add Claude Code agent Lepton submission for headless training demo
savitha-eng Mar 17, 2026
05dc34f
Fix Claude Code auth: use ANTHROPIC_AUTH_TOKEN for NIM gateway
savitha-eng Mar 17, 2026
c1f9d2a
Run Claude Code as non-root user to allow --dangerously-skip-permissions
savitha-eng Mar 17, 2026
ea976ff
Fix su command: add missing -c flag
savitha-eng Mar 17, 2026
10e4ea4
Fix model ID: use bedrock-claude-opus-4-6 for NIM gateway
savitha-eng Mar 17, 2026
c9cfe38
Add CUDA bin/lib to non-root user PATH for ptxas
savitha-eng Mar 17, 2026
2058254
Use su without dash to preserve full NVIDIA container environment
savitha-eng Mar 17, 2026
78d447c
Fix FP8 autocast double-nesting for first/last BF16 layers
savitha-eng Mar 17, 2026
a58c44a
Port layer-wise FP8/FP4 precision control and quant stats to OG2 recipe
savitha-eng Mar 17, 2026
9375e9e
Add FP8 Precision Agent guide, strategies, baseline, and Nemotron paper
savitha-eng Mar 17, 2026
6768d44
Update Lepton submission for FP8 Precision Agent with multi-node support
savitha-eng Mar 17, 2026
780a00f
Add agent infrastructure modules (daemon, analyzer, journal, metrics,…
savitha-eng Mar 17, 2026
a14c53d
Update guide to use in-repo baseline path and concrete NFS defaults
savitha-eng Mar 17, 2026
09a065e
Add Lepton job to fix checkpoint directory collisions from old runs
savitha-eng Mar 17, 2026
9f4a0c9
Fix wandb group: compute run_name ONCE at startup, never change acros…
savitha-eng Mar 17, 2026
cef1136
Fix checkpoint bugs and add warm-start support for agent experiments
savitha-eng Mar 17, 2026
ef6bf7b
Fix agent guide: pin grad_acc_steps, remove logger.frequency override
savitha-eng Mar 17, 2026
bb76a40
Align agent guide with Jonathan's latest: max_checkpoints=4, kill-fir…
savitha-eng Mar 17, 2026
ca784ec
Add fl2 warm-start config (10K checkpoint, 1% tolerance)
savitha-eng Mar 17, 2026
38aef67
Add research_guided warm-start config (fl2 10K, 1% tolerance)
savitha-eng Mar 18, 2026
b4f4fe4
Disable validation in agent CLI template
savitha-eng Mar 18, 2026
2c503d0
Fix multi-node: workers poll NFS for launch scripts from rank 0
savitha-eng Mar 18, 2026
cabe055
Switch torchrun to static mode matching proven Lepton scripts
savitha-eng Mar 18, 2026
a047ec3
Add post-launch verification and validation as downstream signal
savitha-eng Mar 18, 2026
7030255
Add downstream task evaluation script for llama3_native_te recipe
savitha-eng Mar 18, 2026
12ee5a2
Fix multi-node training: env var passthrough and barrier-based worker…
savitha-eng Mar 18, 2026
69105c6
Fix git sync to handle divergent branches on NFS
savitha-eng Mar 18, 2026
3c350b4
Handle fp8_layers as string from Hydra CLI overrides
savitha-eng Mar 18, 2026
a04a060
Fix checkpoint path confusion and add checkpoint validation
savitha-eng Mar 18, 2026
3632179
Fix CUDA permissions for claude-agent and add debug logging
savitha-eng Mar 18, 2026
cbd3a86
Fix stale launch files causing workers to start before rank 0
savitha-eng Mar 18, 2026
4241b1a
Fix pipe buffering that hides Claude Code output for 10-20 minutes
savitha-eng Mar 18, 2026
96b8340
Fix single-node training: add pre-built $TORCHRUN_PREFIX env var
savitha-eng Mar 18, 2026
c8f8b38
Fix FP8 config: remove incorrect buffer_size override (500k -> inheri…
savitha-eng Mar 18, 2026
953014b
Fix base config dataset: use parquet2 with 8 workers and 10k buffer
savitha-eng Mar 18, 2026
7dd2edd
Add torchrun wrapper that forces correct multi-node flags
savitha-eng Mar 18, 2026
5f1d271
Add 1-node FP8 gradual strategy demo (BF16 baseline + agent)
savitha-eng Mar 19, 2026
113a0e2
Hardcode all training params in agent guide torchrun template
savitha-eng Mar 19, 2026
bef15f7
Fix batch params: mbs=1, grad_acc=8 (GBS=64 on 1 node)
savitha-eng Mar 19, 2026
cd3653a
Fix baseline resume: symlink 5k checkpoint into ckpt_dir
savitha-eng Mar 19, 2026
8609e9f
Fix optimizer betas KeyError after checkpoint resume
savitha-eng Mar 19, 2026
1fb5285
Disable async checkpoint save in BF16 baseline config
savitha-eng Mar 19, 2026
9bcb664
Disable all checkpoint saving for BF16 baseline
savitha-eng Mar 19, 2026
bfea10b
Add BF16 baseline metrics for 1-node demo (steps 5100-5900)
savitha-eng Mar 19, 2026
5525f76
Rename baseline to continuous, fresh ckpt dir, no saves
savitha-eng Mar 19, 2026
ac09dff
Update baseline with clean continuous BF16 run (WandB g2vlxphe)
savitha-eng Mar 19, 2026
5303da6
Port dataloader & WandB fixes to OG2 FP8 agent guides
savitha-eng Mar 19, 2026
87e333a
Make baseline logfile path configurable in agent prompt
savitha-eng Mar 19, 2026
fcef207
Fix WandB resume: add +wandb.id for run continuity
savitha-eng Mar 19, 2026
39c7914
Add FL2 FP8 block scaling quant stats config for diagnostics
savitha-eng Mar 20, 2026
1343edf
Update lepton config for eden submit script format
savitha-eng Mar 20, 2026
d45bdf0
Fix dataset and checkpoint resume for FL2 quant stats run
savitha-eng Mar 20, 2026
fd72837
Match ESM2 proven fp8_debugging_stats config format
savitha-eng Mar 20, 2026
ff7673b
Fix: initialize nvdlfw_inspect before TE model creation
savitha-eng Mar 20, 2026
5e93843
Add FL4 quant stats diagnostic run configs
savitha-eng Mar 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
562 changes: 562 additions & 0 deletions bionemo-recipes/recipes/llama3_native_te/eval_downstream.py

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# OpenGenome2 7B - FP8 Refactor Branch Test
# Same settings as og2-7b-fp32mw-pq2-cfi-false-lf100 but:
# - On savitha/og2-fp8-refactor branch (FP8 logic moved into model)
# - FP8 enabled on ALL layers (including first/last, no BF16 override)
# - FP32 master weights
# - No CP (standard FSDP2 only)
#
# Data: /data/opengenome2/parquet2
# 6 nodes H100, THD format, GQA, FP8 + FP32 master weights
# GBS = mbs * grad_acc * dp_size = 1 * 8 * 48 = 384
defaults:
- _self_

job_name: "og2-7b-fp8-refactor-all-fp8-fp32mw"
node_group: "yo-bom-lepton-001"
resource_shape: "gpu.8xh100-sxm"

num_nodes: 6
gpus_per_node: 8
num_train_steps: 182314
micro_batch_size: 1
grad_acc_steps: 8

dataset_path: "/data/opengenome2/parquet2"
data_dir: ""
num_workers: 8
buffer_size: 10000

repo_root: "/data/savithas/bionemo-framework"
code_path: "/data/savithas/bionemo-framework/bionemo-recipes/recipes/opengenome2_llama_native_te"
train_script: "train_fsdp2.py"
hydra_config: "og2_7b_thd_gqa"

git_branch: "savitha/og2-fp8-refactor"

validation_enabled: false

spike_no_more_embedding_init: true
skip_embedding_weight_decay: true
use_megatron_scaled_init: true
use_weight_decay_grouping: true
use_meta_device: false

# FP8 enabled on ALL layers (fp8_first_last_bf16 stays false in base config)
fp8_enabled: true
fp8_recipe: transformer_engine.common.recipe.Float8BlockScaling
fp8_format: E4M3
use_fp32_master_weights: true

logger_frequency: 100

checkpoint_dir: "/data/savithas/checkpoints/og2-7b-fp8-refactor-all-fp8-fp32mw" # pragma: allowlist secret
save_every_n_steps: 5000
async_save: false

wandb_project: "llama3-metagenome-7b"
wandb_name: "og2-7b-fp8-refactor-all-fp8-fp32mw"
wandb_secret: "wandb.savithas" # pragma: allowlist secret

hf_secret: "HUGGING_FACE_HUB_TOKEN.savithas" # pragma: allowlist secret

exclude_nodes:
- node-ip-10-50-80-195
- node-ip-10-50-81-231
- nvidia-lepton093
- nvidia-lepton007

container:
image: "nvcr.io/nvidia/pytorch:25.11-py3"
registry_auth: "lepton-nvidia-cvai-bnmo-trng"
Loading
Loading