Skip to content

[DRAFT - do not review] Savitha/claude lepton dev experiment#1525

Draft
savitha-eng wants to merge 45 commits intomainfrom
savitha/claude-lepton-dev-experiment
Draft

[DRAFT - do not review] Savitha/claude lepton dev experiment#1525
savitha-eng wants to merge 45 commits intomainfrom
savitha/claude-lepton-dev-experiment

Conversation

@savitha-eng
Copy link
Collaborator

Description

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

savitha-eng and others added 8 commits March 13, 2026 23:12
Port the ESM2/llama3 FP8 refactor pattern to OG2: the model now handles
te.autocast and quantized_model_init internally via get_autocast_context(),
so training scripts just call model(**batch) without external FP8 wrappers.

- Add layer_precision and use_quantized_model_init to NVLlamaConfig
- Add get_autocast_context() to NVLlamaModel for per-layer FP8 control
- Pass fp8_recipe to model constructor in train_fsdp2.py and train_fsdp2_cp.py
- Remove external te.autocast from forward pass in both training scripts
- Remove quantized_model_init_kwargs from hydra configs (model handles it)
- Remove te.fp8_autocast wrapper from evaluate_fasta_lm_loss.py
- Add Lepton config for all-FP8 + FP32 master weights experiment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds submit script, agent prompt template, and Lepton job config
to run Claude Code autonomously on a GPU node for training.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
The NVIDIA LLM gateway requires ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY).
Also fix model ID to match gateway format.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Claude Code blocks --dangerously-skip-permissions when running as root.
Create a non-root user and use a wrapper script to avoid quoting issues.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
su - (login shell) resets all env vars, losing CUDA, NCCL, HPC-X, etc.
su (no dash) preserves the root environment so the non-root user inherits
everything the NVIDIA container set up.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ea3fa6d5-c6aa-4887-9a9d-7f6ef09a78db

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch savitha/claude-lepton-dev-experiment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

savitha-eng and others added 18 commits March 17, 2026 06:54
Align with Jonathan's pattern: outer te.autocast enables FP8 globally,
per-layer get_layer_autocast returns nullcontext for FP8 layers (outer
takes effect) and te.autocast(enabled=False) for BF16 layers (clean
override). Previous approach double-nested te.autocast(enabled=True)
which could corrupt TE's internal autocast state.

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
- Replace fp8_first_last_bf16 config with fp8_layers/fp4_layers lists
- Add resolve_layer_precision() in new quantization.py module
- Add FP4 recipe support (NVLlamaConfig, NVLlamaModel, NVLlamaForCausalLM)
- Add set_recipes() for post-FSDP recipe attachment
- Rename fp8_stats_config to quant_stats_config with initialize_quant_stats_logging()
- Update perf_logger.py to handle both old and new config names
- Add fp8_debugging_stats.yaml for TE debug feature config
- Add test_quantization.py with comprehensive tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- OG2_FP8_AGENT_GUIDE.md: Complete agent specification for OG2 FP8 Block Scaling
  (adapted from Jonathan's NVFP4 guide for ESM2)
- OG2_STRATEGY_ENDS_IN.md: Demote from both ends inward strategy
- OG2_STRATEGY_TAIL_IN.md: Demote from output end toward head strategy
- baseline_bf16.json: BF16 baseline metrics (1823 steps from WandB run 8mfsb27t)
- extract_baseline_metrics.py: Script to extract baseline from WandB
- references/NVIDIA-Nemotron-3-Super-Technical-Report.pdf: Research paper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite claude_agent_prompt.txt to reference OG2_FP8_AGENT_GUIDE.md
  instead of hardcoding a simple training command
- Update submit_claude_agent_lepton.py for multi-node: rank 0 runs Claude,
  other ranks wait for torchrun connections
- Add og2_fp8_agent.yaml config (6-node OG2-7B, 182K steps, ends_in strategy)
- Update claude_agent_demo.yaml with agent config fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… control plane)

These modules support the agent's runtime infrastructure for monitoring,
intervention, and metrics collection during FP8 precision training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- BASELINE_LOGFILE now defaults to ./baseline_bf16.json (co-located)
- WORKSPACE_ROOT/RESULTS_FOLDER use concrete /data/savithas/ NFS paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moves old run checkpoints (15K-35K) out of the way so current runs
can save their 10K checkpoints cleanly without max_checkpoints rotation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s relaunches

Aligns with Jonathan's fix — wandb group was ambiguous about whether it
gets recomputed on each relaunch. Now explicitly states it's computed once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- checkpoint.max_checkpoints: 5 -> 2 (avoid old checkpoint collisions)
- checkpoint.ckpt_dir and resume_from_checkpoint marked FIXED (never change)
- Recovery flow now explicitly deletes checkpoints newer than LKG
- Checkpoints stored at /data/savithas/checkpoints/<run_name> (matches job/wandb name)
- Add warm_start config block to prompt template and submission script
- New lepton config: og2_fp8_agent_fl4_warmstart.yaml (fl4 5K checkpoint, ends_in round 4)
- Agent prompt dynamically builds warm-start or fresh-start section from YAML config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add grad_acc_steps=$GRAD_ACC_STEPS to CLI template (prevents agent from scaling it)
- Remove logger.frequency override (Hydra config has frequency=1, logs every step)
- Add CRITICAL note: agent must use template EXACTLY, do NOT add/modify params
- List grad_acc_steps in FIXED fields section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st-failure rule

- checkpoint.max_checkpoints: 2 -> 4 (buffer so LKG isn't auto-deleted before agent acts)
- Add "kill IMMEDIATELY on FIRST failure" rule for multi-step check-in processing
- Add hydra.run.dir to keep Hydra outputs organized
- Recovery step 4: explicit "do NOT change num_train_steps" note

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New config og2_fp8_agent_fl2_warmstart.yaml: layers 1-2 and 31-32 in
  BF16, layers 3-30 in FP8, demotion_round=2, lkg_step=10000
- Make TOLERANCE_PCT configurable via prompt template (default 5%, fl2
  uses 1%)
- Add explicit BF16/FP8 layer listing to warm-start prompt section so
  the agent knows exactly which layers are in which precision

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same fl2 checkpoint as ends_in experiment but using research_guided
strategy — agent uses runtime quant stats (underflow %, MSE) to decide
demotion order instead of fixed geometric pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validation was enabled in the Hydra config (og2_7b_thd_gqa.yaml) and
adding unnecessary overhead to agent runs. Add validation.enabled=false
to the fixed CLI parameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers were sleeping instead of running torchrun, causing training to
use only 1 node (8 GPUs) instead of 6 nodes (48 GPUs).

Fix: rank 0 (Claude agent) writes numbered launch scripts to
$LAUNCH_DIR/<N>.sh on NFS before each torchrun invocation. Worker nodes
poll this directory every 5 seconds and execute the same command. When
rank 0 kills training, workers exit via NCCL timeout and poll for the
next script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use --node_rank=$NODE_RANK --master_addr --master_port instead of
rdzv mode. This matches submit_og2_lepton_eden.py which has been
running multi-node training successfully. Also clarify in the
Multi-Node Launch Protocol that workers must use single-quoted
heredocs to preserve $NODE_RANK/$MASTER_ADDR as variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Mandatory post-launch self-check: agent verifies GPU count (48),
  grad_acc_steps (8), effective batch size, and resume step. If wrong,
  agent kills and restarts immediately.
- Re-enable validation at 1000-step intervals as a downstream quality
  signal (FP8 paper notes training loss can diverge without hurting
  downstream tasks). Validation is informational only — does not
  trigger rollbacks. Failures are caught by try/except in training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds eval_downstream.py that runs lm-eval benchmarks (arc_challenge,
arc_easy, boolq, copa, hellaswag, piqa, winogrande) on trained Lingua
1B checkpoints. Supports safetensors, distributed FSDP2 (DCP), and DDP
checkpoint formats. Self-contained checkpoint loading avoids TE version
compatibility issues with checkpoint.py imports.

Made-with: Cursor
@savitha-eng savitha-eng force-pushed the savitha/claude-lepton-dev-experiment branch from d37f8c9 to 7030255 Compare March 18, 2026 01:23
… sync

Two bugs prevented multi-node training:
1. Env vars (MASTER_ADDR, NODE_RANK, NNODES) were lost at the `su claude-agent`
   boundary. Now written to /tmp/training_env.sh and sourced by Claude's wrapper.
2. Workers used independent launch script counters that desynced after kills.
   Now use barrier-based rounds (round_N_ready files) so all workers start together.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
savitha-eng and others added 18 commits March 18, 2026 02:34
Replace `git pull` with `git reset --hard origin/<branch>` so that
force-pushed branches sync correctly on the NFS deploy target.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When fp8_layers is passed via CLI (e.g. fp8_layers='[3,4,...,30]'),
Hydra may parse it as a string instead of a list. Use ast.literal_eval
as fallback to handle both OmegaConf lists and string representations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude agent was using WORKSPACE_ROOT (/data/savithas/agent_runs) instead
of CHECKPOINT_ROOT (/data/savithas/checkpoints) for checkpoint.ckpt_dir,
causing dcp_load to fail with "Connection closed by peer" during warm-start.

Changes:
- Agent prompt: add explicit CHECKPOINT_ROOT vs WORKSPACE_ROOT warning
- Warm-start instructions: use resolved paths instead of $CHECKPOINT_ROOT
- checkpoint.py: validate checkpoint before dcp_load (resolve symlinks,
  check .metadata exists) for clearer error messages
- Guide: warn that checkpoint.ckpt_dir must NOT use WORKSPACE_ROOT

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rank 0's torchrun was crashing at init_process_group because the
claude-agent user (non-root) likely couldn't access /dev/nvidia* devices.

Changes:
- chmod a+rw /dev/nvidia* and /dev/infiniband/* before su claude-agent
- Add claude-agent to video group for GPU access
- Add CUDA sanity check in wrapper (python3 torch.cuda.is_available())
- Log Claude Code output to NFS ($WORKSPACE_ROOT/claude_agent_output.log)
  via tee so we can debug rank 0 issues
- Ensure checkpoint_root and code_path permissions are correct

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The launch directory on NFS persists between jobs with the same job_name.
Old round_1_ready and round_1_args.env files from previous runs caused
workers to immediately start torchrun with stale args before rank 0's
Claude agent had started, leading to init_process_group failures.

Fix: rank 0 cleans old round_*/done files from the launch dir at startup.
Workers wait 5 seconds for cleanup to complete before entering poll loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `claude ... | tee $LOG` with direct file redirect + tail -f.
The pipe caused 4KB block buffering, so no output appeared until
enough text accumulated. Now output goes directly to the log file
and tail -f streams it to container logs in real time.

Also adds timestamps and prompt size logging for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude was constructing its own torchrun command with --nnodes=1, resulting
in 8 GPUs instead of 48. Fix by exporting a pre-built TORCHRUN_PREFIX env
var with the correct multi-node flags (nnodes, node_rank, master_addr,
master_port) and instructing the agent to always use it.

Changes:
- submit script: add TORCHRUN_PREFIX to /tmp/training_env.sh
- agent guide: replace manual torchrun flag construction with $TORCHRUN_PREFIX
- agent prompt: add TORCHRUN_PREFIX to environment section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t 50k)

The FP8 config was overriding buffer_size to 500,000 instead of
inheriting the base config's 50,000. This caused excessive memory
usage during streaming dataset shuffling. Now inherits from
og2_7b_thd_gqa base config (buffer_size=50,000, num_workers=1).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from JSON metagenomes to pre-chunked parquet2 shards to match
the reference BF16 baseline config (og2_bf16_baseline_metrics). Also
corrects num_workers (1→8) and buffer_size (50k→10k).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the real torchrun binary on rank 0 with a wrapper that strips
any --nnodes/--node_rank/--master_addr/--master_port/--nproc_per_node
flags and injects the correct values from the environment. This makes
multi-node training bulletproof: even if Claude constructs "torchrun
--nnodes=1", the wrapper corrects it to --nnodes=$NNODES.

The wrapper logs stripped/injected flags to stderr for debugging.
Workers are unaffected (separate containers).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New files:
- OG2_FP8_1NODE_DEMO_GUIDE.md: Agent guide for gradual FP8 expansion from
  center outward, adapted from NVFP4 guide on og2-fp8-refactor branch
- hydra_config/og2_7b_bf16_1k_from_5k.yaml: BF16 baseline config resuming
  from 5k checkpoint, 1 node (mbs=2, grad_acc=4, GBS=64)
- submit_training_lepton.py: Simple Lepton job submission for non-agent runs
- lepton_configs/og2_bf16_baseline_1node.yaml: Lepton config for BF16 baseline
- lepton_configs/og2_fp8_agent_1node_demo.yaml: Lepton config for agent demo

Modified:
- claude_agent_prompt.txt: Configurable guide filename, single-node instructions
- submit_claude_agent_lepton.py: Gradual strategy warm-start support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explicitly pass dataset.buffer_size=10000, dataset.num_workers=8,
dataset.micro_batch_size=2, grad_acc_steps=4, and all other fixed
values as CLI args so Claude cannot use wrong defaults. Only
fp8_config.enabled, fp8_layers, and wandb.name remain as agent-
controlled placeholders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
micro_batch_size=2 causes OOM on 1 node. Use mbs=1 with
grad_acc_steps=8 to keep GBS=64 (1 × 8 × 8 GPUs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The baseline job was training from scratch because the checkpoint
directory was empty. Add resume_from config + symlink setup in the
container script so the 5k BF16 checkpoint is available for resume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Save optimizer param group hyperparameters (betas, eps, etc.) before
set_state_dict and re-inject them after, fixing KeyError with newer
PyTorch versions. Ported from savitha/og2-fp8-refactor (f5f1949, 83511af).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training keeps crashing at checkpoint save boundaries (step 5500, 5600)
with async_save=true. Switching to synchronous saves for stability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training crashes at every DCP checkpoint save boundary (5500, 5600, 5700)
regardless of sync/async mode. For the baseline we only need WandB metrics,
not the checkpoints. Disable saving entirely to let it run to 6000.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Metrics extracted from 4 WandB runs (bmmijgdt, 9a1cn5ze, 0oyfzc25,
54o6kypu) spanning steps 5001-5999. 9 entries at 100-step intervals
covering warmup and active phase for the FP8 agent demo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant