Skip to content

Conversation

@jwilber
Copy link
Collaborator

@jwilber jwilber commented Oct 13, 2025

Adding new geneformer scripts, one for each of the 10m, 106m, and 4b models.

Summary by CodeRabbit

  • New Features
    • Added ready-to-use training recipes for Geneformer (native TE) at 10M, 106M, and 4B scales.
    • Presets include GPU topology, precision and TE/FP8/THD flags, batch/step sizes, workers, and MLM probability.
    • Two parallelism variants per recipe: DDP and MFSDP.
    • Built-in experiment tracking with configurable project/group/job type and timestamped run names.
    • Auto-computed total GPU count and launch commands with overrideable tracking, resume, and checkpointing options.

Signed-off-by: Jared Wilber <[email protected]>
@jwilber jwilber self-assigned this Oct 13, 2025
@jwilber jwilber added the ciflow:skip Skip all CI tests for this PR label Oct 13, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Adds three Lepton YAML recipes for geneformer_native_te (10m, 106m, 4b) that define job metadata, device/resource shapes, precision/TE/FP8/THD flags, wandb init args, shared training defaults, two product variants (ddp and mfsdp), checkpoint controls, and a torchrun-based run_script.

Changes

Cohort / File(s) Summary
Lepton recipes: geneformer_native_te configs
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml, ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml, ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml
Adds three recipe YAMLs defining job metadata (node_group, num_nodes, device_type, num_devices, gpu_type, resource_shape), recipe identifiers (recipe_subdir, model_type, variant, framework, precision, te_enabled, fp8_enabled, thd_enabled, extras), computed total_gpus, wandb_init_args, shared training defaults (task_cmd, num_train_steps, micro_batch_size, use_te_layers, use_fp8, num_workers, mlm_probability), checkpoint controls (checkpoint_dir, save_every_n_steps, resume_from_checkpoint), two product variants per recipe (ddp and mfsdp) with distinct wandb_name and job_name, and a run_script that invokes torchrun with per-product config overrides (training/model/wandb/checkpoint flags).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Lepton as Lepton Job
  participant TorchRun
  participant Trainer
  participant WandB as W&B

  Operator->>Lepton: select recipe + product (10m/106m/4b, ddp/mfsdp)
  Lepton->>TorchRun: launch with config overrides (nodes, devices, precision, TE, FP8, MFS DP)
  TorchRun->>Trainer: init model & training (TE, fp16/fp8, mfsdp flag)
  Trainer->>WandB: init(project, group, job_type, name)
  Note over Trainer,WandB: observability started

  loop training steps
    Trainer->>Trainer: forward/backward/opt step
    alt mfsdp enabled
      Trainer->>Trainer: sharded param sync
    else ddp
      Trainer->>Trainer: data-parallel sync
    end
    alt checkpointing enabled
      Trainer->>Trainer: save checkpoint at interval
    end
  end

  Trainer-->>WandB: finalize run
  TorchRun-->>Lepton: exit status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I thump and patch beneath the tree,
Three recipes folded just for me.
DDP or shards, flags set with care,
Torchrun whistles through the air.
W&B logs the carrot dreams I keep. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is a single-line statement and does not follow the repository’s required template, as it lacks sections for a detailed description, usage example, type of changes, CI pipeline configuration, and pre-submit checklist. Without these elements, reviewers cannot assess how to use the new configs, the scope of changes, or verify that necessary tests and documentation updates have been addressed. Please update the PR description to match the provided template by adding a detailed description of the changes, a usage code snippet, and selecting the appropriate type of changes. Additionally, include any relevant CI pipeline labels and complete the pre-submit checklist with confirmations of local testing, documentation updates, and added or updated tests.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Add new geneformer configs” succinctly captures the primary change of introducing new geneformer configuration files and is directly related to the main updates in the changeset. It is concise, clear, and specific enough that a reader scanning the history will understand the nature of the addition without extraneous detail.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jwilber/geneformer-configs

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b82e604 and 39a50ec.

📒 Files selected for processing (1)
  • ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Jared Wilber <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (12)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (4)

62-66: Checkpointing disabled but save interval set — align flags

With checkpoint_dir=null, keeping save_every_n_steps=100 can cause confusing behavior depending on trainer logic. Prefer disabling saves when dir is null.

- save_every_n_steps: 100
+ save_every_n_steps: 0

Also applies to: 102-104


54-54: Avoid numeric underscores for compatibility

10_000 may be parsed as a string by some YAML/OmegaConf parsers. Use 10000.

- num_train_steps: 10_000
+ num_train_steps: 10000

20-20: Name/flag mismatch (fp8 in recipe_subdir but fp8_enabled=false)

Either enable fp8, or drop fp8 from the identifier to avoid confusion.

- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp

Also applies to: 28-29


95-95: Disambiguate WandB group by model size

Optional: append ${config} so ddp/mfsdp runs are grouped per model size across recipes.

-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (4)

62-66: Disable save interval when checkpointing is off

Set save_every_n_steps: 0 to avoid mixed signals.

- save_every_n_steps: 100
+ save_every_n_steps: 0

Also applies to: 102-104


54-54: Prefer 10000 over 10_000

Avoid underscore for broader parser compatibility.

- num_train_steps: 10_000
+ num_train_steps: 10000

20-20: Identifier/flag mismatch (fp8 in name vs fp8_enabled=false)

Consider dropping fp8 from recipe_subdir or enabling fp8 if intended.

- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp

Also applies to: 28-29


95-95: Optional: include ${config} in WandB group

Helps group runs by model size.

-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (4)

62-66: If checkpointing is off, set save_every_n_steps to 0

Avoid passing save intervals alongside a null checkpoint_dir.

- save_every_n_steps: 100
+ save_every_n_steps: 0

Also applies to: 102-104


54-54: Use 10000 instead of 10_000

Safer cross-parser behavior.

- num_train_steps: 10_000
+ num_train_steps: 10000

20-20: recipe_subdir mentions fp8 while disabled

Align naming with flags to reduce confusion.

- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp

Also applies to: 28-29


95-95: Optional: add ${config} to WandB group

Differentiates 10m vs other sizes.

-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff2e4b and 28749ef.

📒 Files selected for processing (3)
  • ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (1 hunks)
  • ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (1 hunks)
  • ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (6)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (2)

93-97: wandb_init_args.mode is defined upstreamci/lepton/model_convergence/configs/base.yaml sets it to "online", so no action required.


87-90: Referenced model config exists and is loadable by Hydra

Hydra will find hydra_config/106m.yaml (which defaults to model: 106m) and load hydra_config/model/106m.yaml when invoked with --config-name 106m.yaml.

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (2)

93-97: wandb_init_args.mode is defined in base.yaml (line 68), so the reference in this recipe is valid.


87-90: 4b.yaml is present and discoverable under the Hydra config directories
Files recipes/geneformer_native_te_mfsdp_fp8/hydra_config/4b.yaml and recipes/geneformer_native_te_mfsdp_fp8/hydra_config/model/4b.yaml exist.

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (2)

93-97: wandb_init_args.mode is defined in base.yaml (line 68); reference is valid.


87-90: Hydra ‘10m.yaml’ and its model config exist under recipes/geneformer_native_te_mfsdp_fp8/hydra_config, so --config-name 10m.yaml will resolve correctly.

Signed-off-by: Jared Wilber <[email protected]>
Signed-off-by: Jared Wilber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:skip Skip all CI tests for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants