[DRAFT - do not review] Savitha/claude lepton dev experiment#1525
Draft
savitha-eng wants to merge 45 commits intomainfrom
Draft
[DRAFT - do not review] Savitha/claude lepton dev experiment#1525savitha-eng wants to merge 45 commits intomainfrom
savitha-eng wants to merge 45 commits intomainfrom
Conversation
Port the ESM2/llama3 FP8 refactor pattern to OG2: the model now handles te.autocast and quantized_model_init internally via get_autocast_context(), so training scripts just call model(**batch) without external FP8 wrappers. - Add layer_precision and use_quantized_model_init to NVLlamaConfig - Add get_autocast_context() to NVLlamaModel for per-layer FP8 control - Pass fp8_recipe to model constructor in train_fsdp2.py and train_fsdp2_cp.py - Remove external te.autocast from forward pass in both training scripts - Remove quantized_model_init_kwargs from hydra configs (model handles it) - Remove te.fp8_autocast wrapper from evaluate_fasta_lm_loss.py - Add Lepton config for all-FP8 + FP32 master weights experiment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds submit script, agent prompt template, and Lepton job config to run Claude Code autonomously on a GPU node for training. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
The NVIDIA LLM gateway requires ANTHROPIC_AUTH_TOKEN (not ANTHROPIC_API_KEY). Also fix model ID to match gateway format. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Claude Code blocks --dangerously-skip-permissions when running as root. Create a non-root user and use a wrapper script to avoid quoting issues. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
su - (login shell) resets all env vars, losing CUDA, NCCL, HPC-X, etc. su (no dash) preserves the root environment so the non-root user inherits everything the NVIDIA container set up. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Contributor
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Align with Jonathan's pattern: outer te.autocast enables FP8 globally, per-layer get_layer_autocast returns nullcontext for FP8 layers (outer takes effect) and te.autocast(enabled=False) for BF16 layers (clean override). Previous approach double-nested te.autocast(enabled=True) which could corrupt TE's internal autocast state. Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
- Replace fp8_first_last_bf16 config with fp8_layers/fp4_layers lists - Add resolve_layer_precision() in new quantization.py module - Add FP4 recipe support (NVLlamaConfig, NVLlamaModel, NVLlamaForCausalLM) - Add set_recipes() for post-FSDP recipe attachment - Rename fp8_stats_config to quant_stats_config with initialize_quant_stats_logging() - Update perf_logger.py to handle both old and new config names - Add fp8_debugging_stats.yaml for TE debug feature config - Add test_quantization.py with comprehensive tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- OG2_FP8_AGENT_GUIDE.md: Complete agent specification for OG2 FP8 Block Scaling (adapted from Jonathan's NVFP4 guide for ESM2) - OG2_STRATEGY_ENDS_IN.md: Demote from both ends inward strategy - OG2_STRATEGY_TAIL_IN.md: Demote from output end toward head strategy - baseline_bf16.json: BF16 baseline metrics (1823 steps from WandB run 8mfsb27t) - extract_baseline_metrics.py: Script to extract baseline from WandB - references/NVIDIA-Nemotron-3-Super-Technical-Report.pdf: Research paper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite claude_agent_prompt.txt to reference OG2_FP8_AGENT_GUIDE.md instead of hardcoding a simple training command - Update submit_claude_agent_lepton.py for multi-node: rank 0 runs Claude, other ranks wait for torchrun connections - Add og2_fp8_agent.yaml config (6-node OG2-7B, 182K steps, ends_in strategy) - Update claude_agent_demo.yaml with agent config fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… control plane) These modules support the agent's runtime infrastructure for monitoring, intervention, and metrics collection during FP8 precision training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- BASELINE_LOGFILE now defaults to ./baseline_bf16.json (co-located) - WORKSPACE_ROOT/RESULTS_FOLDER use concrete /data/savithas/ NFS paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moves old run checkpoints (15K-35K) out of the way so current runs can save their 10K checkpoints cleanly without max_checkpoints rotation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s relaunches Aligns with Jonathan's fix — wandb group was ambiguous about whether it gets recomputed on each relaunch. Now explicitly states it's computed once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- checkpoint.max_checkpoints: 5 -> 2 (avoid old checkpoint collisions) - checkpoint.ckpt_dir and resume_from_checkpoint marked FIXED (never change) - Recovery flow now explicitly deletes checkpoints newer than LKG - Checkpoints stored at /data/savithas/checkpoints/<run_name> (matches job/wandb name) - Add warm_start config block to prompt template and submission script - New lepton config: og2_fp8_agent_fl4_warmstart.yaml (fl4 5K checkpoint, ends_in round 4) - Agent prompt dynamically builds warm-start or fresh-start section from YAML config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add grad_acc_steps=$GRAD_ACC_STEPS to CLI template (prevents agent from scaling it) - Remove logger.frequency override (Hydra config has frequency=1, logs every step) - Add CRITICAL note: agent must use template EXACTLY, do NOT add/modify params - List grad_acc_steps in FIXED fields section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st-failure rule - checkpoint.max_checkpoints: 2 -> 4 (buffer so LKG isn't auto-deleted before agent acts) - Add "kill IMMEDIATELY on FIRST failure" rule for multi-step check-in processing - Add hydra.run.dir to keep Hydra outputs organized - Recovery step 4: explicit "do NOT change num_train_steps" note Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New config og2_fp8_agent_fl2_warmstart.yaml: layers 1-2 and 31-32 in BF16, layers 3-30 in FP8, demotion_round=2, lkg_step=10000 - Make TOLERANCE_PCT configurable via prompt template (default 5%, fl2 uses 1%) - Add explicit BF16/FP8 layer listing to warm-start prompt section so the agent knows exactly which layers are in which precision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same fl2 checkpoint as ends_in experiment but using research_guided strategy — agent uses runtime quant stats (underflow %, MSE) to decide demotion order instead of fixed geometric pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validation was enabled in the Hydra config (og2_7b_thd_gqa.yaml) and adding unnecessary overhead to agent runs. Add validation.enabled=false to the fixed CLI parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers were sleeping instead of running torchrun, causing training to use only 1 node (8 GPUs) instead of 6 nodes (48 GPUs). Fix: rank 0 (Claude agent) writes numbered launch scripts to $LAUNCH_DIR/<N>.sh on NFS before each torchrun invocation. Worker nodes poll this directory every 5 seconds and execute the same command. When rank 0 kills training, workers exit via NCCL timeout and poll for the next script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use --node_rank=$NODE_RANK --master_addr --master_port instead of rdzv mode. This matches submit_og2_lepton_eden.py which has been running multi-node training successfully. Also clarify in the Multi-Node Launch Protocol that workers must use single-quoted heredocs to preserve $NODE_RANK/$MASTER_ADDR as variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Mandatory post-launch self-check: agent verifies GPU count (48), grad_acc_steps (8), effective batch size, and resume step. If wrong, agent kills and restarts immediately. - Re-enable validation at 1000-step intervals as a downstream quality signal (FP8 paper notes training loss can diverge without hurting downstream tasks). Validation is informational only — does not trigger rollbacks. Failures are caught by try/except in training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds eval_downstream.py that runs lm-eval benchmarks (arc_challenge, arc_easy, boolq, copa, hellaswag, piqa, winogrande) on trained Lingua 1B checkpoints. Supports safetensors, distributed FSDP2 (DCP), and DDP checkpoint formats. Self-contained checkpoint loading avoids TE version compatibility issues with checkpoint.py imports. Made-with: Cursor
d37f8c9 to
7030255
Compare
… sync Two bugs prevented multi-node training: 1. Env vars (MASTER_ADDR, NODE_RANK, NNODES) were lost at the `su claude-agent` boundary. Now written to /tmp/training_env.sh and sourced by Claude's wrapper. 2. Workers used independent launch script counters that desynced after kills. Now use barrier-based rounds (round_N_ready files) so all workers start together. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `git pull` with `git reset --hard origin/<branch>` so that force-pushed branches sync correctly on the NFS deploy target. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When fp8_layers is passed via CLI (e.g. fp8_layers='[3,4,...,30]'), Hydra may parse it as a string instead of a list. Use ast.literal_eval as fallback to handle both OmegaConf lists and string representations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude agent was using WORKSPACE_ROOT (/data/savithas/agent_runs) instead of CHECKPOINT_ROOT (/data/savithas/checkpoints) for checkpoint.ckpt_dir, causing dcp_load to fail with "Connection closed by peer" during warm-start. Changes: - Agent prompt: add explicit CHECKPOINT_ROOT vs WORKSPACE_ROOT warning - Warm-start instructions: use resolved paths instead of $CHECKPOINT_ROOT - checkpoint.py: validate checkpoint before dcp_load (resolve symlinks, check .metadata exists) for clearer error messages - Guide: warn that checkpoint.ckpt_dir must NOT use WORKSPACE_ROOT Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rank 0's torchrun was crashing at init_process_group because the claude-agent user (non-root) likely couldn't access /dev/nvidia* devices. Changes: - chmod a+rw /dev/nvidia* and /dev/infiniband/* before su claude-agent - Add claude-agent to video group for GPU access - Add CUDA sanity check in wrapper (python3 torch.cuda.is_available()) - Log Claude Code output to NFS ($WORKSPACE_ROOT/claude_agent_output.log) via tee so we can debug rank 0 issues - Ensure checkpoint_root and code_path permissions are correct Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The launch directory on NFS persists between jobs with the same job_name. Old round_1_ready and round_1_args.env files from previous runs caused workers to immediately start torchrun with stale args before rank 0's Claude agent had started, leading to init_process_group failures. Fix: rank 0 cleans old round_*/done files from the launch dir at startup. Workers wait 5 seconds for cleanup to complete before entering poll loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `claude ... | tee $LOG` with direct file redirect + tail -f. The pipe caused 4KB block buffering, so no output appeared until enough text accumulated. Now output goes directly to the log file and tail -f streams it to container logs in real time. Also adds timestamps and prompt size logging for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude was constructing its own torchrun command with --nnodes=1, resulting in 8 GPUs instead of 48. Fix by exporting a pre-built TORCHRUN_PREFIX env var with the correct multi-node flags (nnodes, node_rank, master_addr, master_port) and instructing the agent to always use it. Changes: - submit script: add TORCHRUN_PREFIX to /tmp/training_env.sh - agent guide: replace manual torchrun flag construction with $TORCHRUN_PREFIX - agent prompt: add TORCHRUN_PREFIX to environment section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t 50k) The FP8 config was overriding buffer_size to 500,000 instead of inheriting the base config's 50,000. This caused excessive memory usage during streaming dataset shuffling. Now inherits from og2_7b_thd_gqa base config (buffer_size=50,000, num_workers=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from JSON metagenomes to pre-chunked parquet2 shards to match the reference BF16 baseline config (og2_bf16_baseline_metrics). Also corrects num_workers (1→8) and buffer_size (50k→10k). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the real torchrun binary on rank 0 with a wrapper that strips any --nnodes/--node_rank/--master_addr/--master_port/--nproc_per_node flags and injects the correct values from the environment. This makes multi-node training bulletproof: even if Claude constructs "torchrun --nnodes=1", the wrapper corrects it to --nnodes=$NNODES. The wrapper logs stripped/injected flags to stderr for debugging. Workers are unaffected (separate containers). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New files: - OG2_FP8_1NODE_DEMO_GUIDE.md: Agent guide for gradual FP8 expansion from center outward, adapted from NVFP4 guide on og2-fp8-refactor branch - hydra_config/og2_7b_bf16_1k_from_5k.yaml: BF16 baseline config resuming from 5k checkpoint, 1 node (mbs=2, grad_acc=4, GBS=64) - submit_training_lepton.py: Simple Lepton job submission for non-agent runs - lepton_configs/og2_bf16_baseline_1node.yaml: Lepton config for BF16 baseline - lepton_configs/og2_fp8_agent_1node_demo.yaml: Lepton config for agent demo Modified: - claude_agent_prompt.txt: Configurable guide filename, single-node instructions - submit_claude_agent_lepton.py: Gradual strategy warm-start support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explicitly pass dataset.buffer_size=10000, dataset.num_workers=8, dataset.micro_batch_size=2, grad_acc_steps=4, and all other fixed values as CLI args so Claude cannot use wrong defaults. Only fp8_config.enabled, fp8_layers, and wandb.name remain as agent- controlled placeholders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
micro_batch_size=2 causes OOM on 1 node. Use mbs=1 with grad_acc_steps=8 to keep GBS=64 (1 × 8 × 8 GPUs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The baseline job was training from scratch because the checkpoint directory was empty. Add resume_from config + symlink setup in the container script so the 5k BF16 checkpoint is available for resume. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training keeps crashing at checkpoint save boundaries (step 5500, 5600) with async_save=true. Switching to synchronous saves for stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Training crashes at every DCP checkpoint save boundary (5500, 5600, 5700) regardless of sync/async mode. For the baseline we only need WandB metrics, not the checkpoints. Disable saving entirely to let it run to 6000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Metrics extracted from 4 WandB runs (bmmijgdt, 9a1cn5ze, 0oyfzc25, 54o6kypu) spanning steps 5001-5999. 9 entries at 100-step intervals covering warmup and active phase for the FP8 agent demo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist