[DRAFT - do not review] Savitha/claude lepton dev experiment by savitha-eng · Pull Request #1525 · NVIDIA/bionemo-framework

savitha-eng · 2026-03-17T01:33:51Z

Description

Usage

TODO: Add code snippet

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Port the ESM2/llama3 FP8 refactor pattern to OG2: the model now handles te.autocast and quantized_model_init internally via get_autocast_context(), so training scripts just call model(**batch) without external FP8 wrappers. - Add layer_precision and use_quantized_model_init to NVLlamaConfig - Add get_autocast_context() to NVLlamaModel for per-layer FP8 control - Pass fp8_recipe to model constructor in train_fsdp2.py and train_fsdp2_cp.py - Remove external te.autocast from forward pass in both training scripts - Remove quantized_model_init_kwargs from hydra configs (model handles it) - Remove te.fp8_autocast wrapper from evaluate_fasta_lm_loss.py - Add Lepton config for all-FP8 + FP32 master weights experiment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds submit script, agent prompt template, and Lepton job config to run Claude Code autonomously on a GPU node for training. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

The NVIDIA LLM gateway requires ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY). Also fix model ID to match gateway format. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Claude Code blocks --dangerously-skip-permissions when running as root. Create a non-root user and use a wrapper script to avoid quoting issues. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

su - (login shell) resets all env vars, losing CUDA, NCCL, HPC-X, etc. su (no dash) preserves the root environment so the non-root user inherits everything the NVIDIA container set up. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

copy-pr-bot · 2026-03-17T01:33:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-17T01:34:01Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ea3fa6d5-c6aa-4887-9a9d-7f6ef09a78db

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch savitha/claude-lepton-dev-experiment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Align with Jonathan's pattern: outer te.autocast enables FP8 globally, per-layer get_layer_autocast returns nullcontext for FP8 layers (outer takes effect) and te.autocast(enabled=False) for BF16 layers (clean override). Previous approach double-nested te.autocast(enabled=True) which could corrupt TE's internal autocast state. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

- Replace fp8_first_last_bf16 config with fp8_layers/fp4_layers lists - Add resolve_layer_precision() in new quantization.py module - Add FP4 recipe support (NVLlamaConfig, NVLlamaModel, NVLlamaForCausalLM) - Add set_recipes() for post-FSDP recipe attachment - Rename fp8_stats_config to quant_stats_config with initialize_quant_stats_logging() - Update perf_logger.py to handle both old and new config names - Add fp8_debugging_stats.yaml for TE debug feature config - Add test_quantization.py with comprehensive tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- OG2_FP8_AGENT_GUIDE.md: Complete agent specification for OG2 FP8 Block Scaling (adapted from Jonathan's NVFP4 guide for ESM2) - OG2_STRATEGY_ENDS_IN.md: Demote from both ends inward strategy - OG2_STRATEGY_TAIL_IN.md: Demote from output end toward head strategy - baseline_bf16.json: BF16 baseline metrics (1823 steps from WandB run 8mfsb27t) - extract_baseline_metrics.py: Script to extract baseline from WandB - references/NVIDIA-Nemotron-3-Super-Technical-Report.pdf: Research paper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rewrite claude_agent_prompt.txt to reference OG2_FP8_AGENT_GUIDE.md instead of hardcoding a simple training command - Update submit_claude_agent_lepton.py for multi-node: rank 0 runs Claude, other ranks wait for torchrun connections - Add og2_fp8_agent.yaml config (6-node OG2-7B, 182K steps, ends_in strategy) - Update claude_agent_demo.yaml with agent config fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… control plane) These modules support the agent's runtime infrastructure for monitoring, intervention, and metrics collection during FP8 precision training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- BASELINE_LOGFILE now defaults to ./baseline_bf16.json (co-located) - WORKSPACE_ROOT/RESULTS_FOLDER use concrete /data/savithas/ NFS paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Moves old run checkpoints (15K-35K) out of the way so current runs can save their 10K checkpoints cleanly without max_checkpoints rotation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s relaunches Aligns with Jonathan's fix — wandb group was ambiguous about whether it gets recomputed on each relaunch. Now explicitly states it's computed once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- checkpoint.max_checkpoints: 5 -> 2 (avoid old checkpoint collisions) - checkpoint.ckpt_dir and resume_from_checkpoint marked FIXED (never change) - Recovery flow now explicitly deletes checkpoints newer than LKG - Checkpoints stored at /data/savithas/checkpoints/<run_name> (matches job/wandb name) - Add warm_start config block to prompt template and submission script - New lepton config: og2_fp8_agent_fl4_warmstart.yaml (fl4 5K checkpoint, ends_in round 4) - Agent prompt dynamically builds warm-start or fresh-start section from YAML config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add grad_acc_steps=$GRAD_ACC_STEPS to CLI template (prevents agent from scaling it) - Remove logger.frequency override (Hydra config has frequency=1, logs every step) - Add CRITICAL note: agent must use template EXACTLY, do NOT add/modify params - List grad_acc_steps in FIXED fields section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…st-failure rule - checkpoint.max_checkpoints: 2 -> 4 (buffer so LKG isn't auto-deleted before agent acts) - Add "kill IMMEDIATELY on FIRST failure" rule for multi-step check-in processing - Add hydra.run.dir to keep Hydra outputs organized - Recovery step 4: explicit "do NOT change num_train_steps" note Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- New config og2_fp8_agent_fl2_warmstart.yaml: layers 1-2 and 31-32 in BF16, layers 3-30 in FP8, demotion_round=2, lkg_step=10000 - Make TOLERANCE_PCT configurable via prompt template (default 5%, fl2 uses 1%) - Add explicit BF16/FP8 layer listing to warm-start prompt section so the agent knows exactly which layers are in which precision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Same fl2 checkpoint as ends_in experiment but using research_guided strategy — agent uses runtime quant stats (underflow %, MSE) to decide demotion order instead of fixed geometric pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Validation was enabled in the Hydra config (og2_7b_thd_gqa.yaml) and adding unnecessary overhead to agent runs. Add validation.enabled=false to the fixed CLI parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workers were sleeping instead of running torchrun, causing training to use only 1 node (8 GPUs) instead of 6 nodes (48 GPUs). Fix: rank 0 (Claude agent) writes numbered launch scripts to $LAUNCH_DIR/<N>.sh on NFS before each torchrun invocation. Worker nodes poll this directory every 5 seconds and execute the same command. When rank 0 kills training, workers exit via NCCL timeout and poll for the next script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use --node_rank=$NODE_RANK --master_addr --master_port instead of rdzv mode. This matches submit_og2_lepton_eden.py which has been running multi-node training successfully. Also clarify in the Multi-Node Launch Protocol that workers must use single-quoted heredocs to preserve $NODE_RANK/$MASTER_ADDR as variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Mandatory post-launch self-check: agent verifies GPU count (48), grad_acc_steps (8), effective batch size, and resume step. If wrong, agent kills and restarts immediately. - Re-enable validation at 1000-step intervals as a downstream quality signal (FP8 paper notes training loss can diverge without hurting downstream tasks). Validation is informational only — does not trigger rollbacks. Failures are caught by try/except in training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds eval_downstream.py that runs lm-eval benchmarks (arc_challenge, arc_easy, boolq, copa, hellaswag, piqa, winogrande) on trained Lingua 1B checkpoints. Supports safetensors, distributed FSDP2 (DCP), and DDP checkpoint formats. Self-contained checkpoint loading avoids TE version compatibility issues with checkpoint.py imports. Made-with: Cursor

… sync Two bugs prevented multi-node training: 1. Env vars (MASTER_ADDR, NODE_RANK, NNODES) were lost at the `su claude-agent` boundary. Now written to /tmp/training_env.sh and sourced by Claude's wrapper. 2. Workers used independent launch script counters that desynced after kills. Now use barrier-based rounds (round_N_ready files) so all workers start together. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace `git pull` with `git reset --hard origin/<branch>` so that force-pushed branches sync correctly on the NFS deploy target. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When fp8_layers is passed via CLI (e.g. fp8_layers='[3,4,...,30]'), Hydra may parse it as a string instead of a list. Use ast.literal_eval as fallback to handle both OmegaConf lists and string representations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Claude agent was using WORKSPACE_ROOT (/data/savithas/agent_runs) instead of CHECKPOINT_ROOT (/data/savithas/checkpoints) for checkpoint.ckpt_dir, causing dcp_load to fail with "Connection closed by peer" during warm-start. Changes: - Agent prompt: add explicit CHECKPOINT_ROOT vs WORKSPACE_ROOT warning - Warm-start instructions: use resolved paths instead of $CHECKPOINT_ROOT - checkpoint.py: validate checkpoint before dcp_load (resolve symlinks, check .metadata exists) for clearer error messages - Guide: warn that checkpoint.ckpt_dir must NOT use WORKSPACE_ROOT Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rank 0's torchrun was crashing at init_process_group because the claude-agent user (non-root) likely couldn't access /dev/nvidia* devices. Changes: - chmod a+rw /dev/nvidia* and /dev/infiniband/* before su claude-agent - Add claude-agent to video group for GPU access - Add CUDA sanity check in wrapper (python3 torch.cuda.is_available()) - Log Claude Code output to NFS ($WORKSPACE_ROOT/claude_agent_output.log) via tee so we can debug rank 0 issues - Ensure checkpoint_root and code_path permissions are correct Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The launch directory on NFS persists between jobs with the same job_name. Old round_1_ready and round_1_args.env files from previous runs caused workers to immediately start torchrun with stale args before rank 0's Claude agent had started, leading to init_process_group failures. Fix: rank 0 cleans old round_*/done files from the launch dir at startup. Workers wait 5 seconds for cleanup to complete before entering poll loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace `claude ... | tee $LOG` with direct file redirect + tail -f. The pipe caused 4KB block buffering, so no output appeared until enough text accumulated. Now output goes directly to the log file and tail -f streams it to container logs in real time. Also adds timestamps and prompt size logging for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Claude was constructing its own torchrun command with --nnodes=1, resulting in 8 GPUs instead of 48. Fix by exporting a pre-built TORCHRUN_PREFIX env var with the correct multi-node flags (nnodes, node_rank, master_addr, master_port) and instructing the agent to always use it. Changes: - submit script: add TORCHRUN_PREFIX to /tmp/training_env.sh - agent guide: replace manual torchrun flag construction with $TORCHRUN_PREFIX - agent prompt: add TORCHRUN_PREFIX to environment section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t 50k) The FP8 config was overriding buffer_size to 500,000 instead of inheriting the base config's 50,000. This caused excessive memory usage during streaming dataset shuffling. Now inherits from og2_7b_thd_gqa base config (buffer_size=50,000, num_workers=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from JSON metagenomes to pre-chunked parquet2 shards to match the reference BF16 baseline config (og2_bf16_baseline_metrics). Also corrects num_workers (1→8) and buffer_size (50k→10k). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the real torchrun binary on rank 0 with a wrapper that strips any --nnodes/--node_rank/--master_addr/--master_port/--nproc_per_node flags and injects the correct values from the environment. This makes multi-node training bulletproof: even if Claude constructs "torchrun --nnodes=1", the wrapper corrects it to --nnodes=$NNODES. The wrapper logs stripped/injected flags to stderr for debugging. Workers are unaffected (separate containers). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New files: - OG2_FP8_1NODE_DEMO_GUIDE.md: Agent guide for gradual FP8 expansion from center outward, adapted from NVFP4 guide on og2-fp8-refactor branch - hydra_config/og2_7b_bf16_1k_from_5k.yaml: BF16 baseline config resuming from 5k checkpoint, 1 node (mbs=2, grad_acc=4, GBS=64) - submit_training_lepton.py: Simple Lepton job submission for non-agent runs - lepton_configs/og2_bf16_baseline_1node.yaml: Lepton config for BF16 baseline - lepton_configs/og2_fp8_agent_1node_demo.yaml: Lepton config for agent demo Modified: - claude_agent_prompt.txt: Configurable guide filename, single-node instructions - submit_claude_agent_lepton.py: Gradual strategy warm-start support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Explicitly pass dataset.buffer_size=10000, dataset.num_workers=8, dataset.micro_batch_size=2, grad_acc_steps=4, and all other fixed values as CLI args so Claude cannot use wrong defaults. Only fp8_config.enabled, fp8_layers, and wandb.name remain as agent- controlled placeholders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

micro_batch_size=2 causes OOM on 1 node. Use mbs=1 with grad_acc_steps=8 to keep GBS=64 (1 × 8 × 8 GPUs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The baseline job was training from scratch because the checkpoint directory was empty. Add resume_from config + symlink setup in the container script so the 5k BF16 checkpoint is available for resume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Save optimizer param group hyperparameters (betas, eps, etc.) before set_state_dict and re-inject them after, fixing KeyError with newer PyTorch versions. Ported from savitha/og2-fp8-refactor (f5f1949, 83511af). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Training keeps crashing at checkpoint save boundaries (step 5500, 5600) with async_save=true. Switching to synchronous saves for stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Training crashes at every DCP checkpoint save boundary (5500, 5600, 5700) regardless of sync/async mode. For the baseline we only need WandB metrics, not the checkpoints. Disable saving entirely to let it run to 6000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Metrics extracted from 4 WandB runs (bmmijgdt, 9a1cn5ze, 0oyfzc25, 54o6kypu) spanning steps 5001-5999. 9 entries at 100-step intervals covering warmup and active phase for the FP8 agent demo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

savitha-eng and others added 8 commits March 13, 2026 23:12

Add Claude Code agent Lepton submission for headless training demo

5e80225

Adds submit script, agent prompt template, and Lepton job config to run Claude Code autonomously on a GPU node for training. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Fix Claude Code auth: use ANTHROPIC_AUTH_TOKEN for NIM gateway

05dc34f

The NVIDIA LLM gateway requires ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY). Also fix model ID to match gateway format. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Run Claude Code as non-root user to allow --dangerously-skip-permissions

c1f9d2a

Claude Code blocks --dangerously-skip-permissions when running as root. Create a non-root user and use a wrapper script to avoid quoting issues. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Fix su command: add missing -c flag

ea976ff

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Fix model ID: use bedrock-claude-opus-4-6 for NIM gateway

10e4ea4

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

Add CUDA bin/lib to non-root user PATH for ptxas

c9cfe38

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>

savitha-eng and others added 18 commits March 17, 2026 06:54

Update guide to use in-repo baseline path and concrete NFS defaults

a14c53d

- BASELINE_LOGFILE now defaults to ./baseline_bf16.json (co-located) - WORKSPACE_ROOT/RESULTS_FOLDER use concrete /data/savithas/ NFS paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Lepton job to fix checkpoint directory collisions from old runs

09a065e

Moves old run checkpoints (15K-35K) out of the way so current runs can save their 10K checkpoints cleanly without max_checkpoints rotation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Disable validation in agent CLI template

b4f4fe4

Validation was enabled in the Hydra config (og2_7b_thd_gqa.yaml) and adding unnecessary overhead to agent runs. Add validation.enabled=false to the fixed CLI parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

savitha-eng force-pushed the savitha/claude-lepton-dev-experiment branch from d37f8c9 to 7030255 Compare March 18, 2026 01:23

savitha-eng and others added 18 commits March 18, 2026 02:34

Fix git sync to handle divergent branches on NFS

69105c6

Replace `git pull` with `git reset --hard origin/<branch>` so that force-pushed branches sync correctly on the NFS deploy target. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix batch params: mbs=1, grad_acc=8 (GBS=64 on 1 node)

bef15f7

micro_batch_size=2 causes OOM on 1 node. Use mbs=1 with grad_acc_steps=8 to keep GBS=64 (1 × 8 × 8 GPUs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Disable async checkpoint save in BF16 baseline config

1fb5285

Training keeps crashing at checkpoint save boundaries (step 5500, 5600) with async_save=true. Switching to synchronous saves for stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT - do not review] Savitha/claude lepton dev experiment#1525

[DRAFT - do not review] Savitha/claude lepton dev experiment#1525
savitha-eng wants to merge 45 commits intomainfrom
savitha/claude-lepton-dev-experiment

savitha-eng commented Mar 17, 2026

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

savitha-eng commented Mar 17, 2026

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 17, 2026 •

edited

Loading