Add new geneformer configs #1251

jwilber · 2025-10-13T19:22:43Z

Adding new geneformer scripts, one for each of the 10m, 106m, and 4b models.

Summary by CodeRabbit

New Features
- Added ready-to-use training recipes for Geneformer (native TE) at 10M, 106M, and 4B scales.
- Presets include GPU topology, precision and TE/FP8/THD flags, batch/step sizes, workers, and MLM probability.
- Two parallelism variants per recipe: DDP and MFSDP.
- Built-in experiment tracking with configurable project/group/job type and timestamped run names.
- Auto-computed total GPU count and launch commands with overrideable tracking, resume, and checkpointing options.

Signed-off-by: Jared Wilber <[email protected]>

copy-pr-bot · 2025-10-13T19:22:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-13T19:23:06Z

Walkthrough

Adds three Lepton YAML recipes for geneformer_native_te (10m, 106m, 4b) that define job metadata, device/resource shapes, precision/TE/FP8/THD flags, wandb init args, shared training defaults, two product variants (ddp and mfsdp), checkpoint controls, and a torchrun-based run_script.

Changes

Cohort / File(s)	Summary
Lepton recipes: geneformer_native_te configs `ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml`, `ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml`, `ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml`	Adds three recipe YAMLs defining job metadata (node_group, num_nodes, device_type, num_devices, gpu_type, resource_shape), recipe identifiers (recipe_subdir, model_type, variant, framework, precision, te_enabled, fp8_enabled, thd_enabled, extras), computed `total_gpus`, `wandb_init_args`, shared training defaults (task_cmd, num_train_steps, micro_batch_size, use_te_layers, use_fp8, num_workers, mlm_probability), checkpoint controls (`checkpoint_dir`, `save_every_n_steps`, `resume_from_checkpoint`), two product variants per recipe (ddp and mfsdp) with distinct `wandb_name` and `job_name`, and a `run_script` that invokes `torchrun` with per-product config overrides (training/model/wandb/checkpoint flags).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Lepton as Lepton Job
  participant TorchRun
  participant Trainer
  participant WandB as W&B

  Operator->>Lepton: select recipe + product (10m/106m/4b, ddp/mfsdp)
  Lepton->>TorchRun: launch with config overrides (nodes, devices, precision, TE, FP8, MFS DP)
  TorchRun->>Trainer: init model & training (TE, fp16/fp8, mfsdp flag)
  Trainer->>WandB: init(project, group, job_type, name)
  Note over Trainer,WandB: observability started

  loop training steps
    Trainer->>Trainer: forward/backward/opt step
    alt mfsdp enabled
      Trainer->>Trainer: sharded param sync
    else ddp
      Trainer->>Trainer: data-parallel sync
    end
    alt checkpointing enabled
      Trainer->>Trainer: save checkpoint at interval
    end
  end

  Trainer-->>WandB: finalize run
  TorchRun-->>Lepton: exit status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I thump and patch beneath the tree,
Three recipes folded just for me.
DDP or shards, flags set with care,
Torchrun whistles through the air.
W&B logs the carrot dreams I keep. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is a single-line statement and does not follow the repository’s required template, as it lacks sections for a detailed description, usage example, type of changes, CI pipeline configuration, and pre-submit checklist. Without these elements, reviewers cannot assess how to use the new configs, the scope of changes, or verify that necessary tests and documentation updates have been addressed.	Please update the PR description to match the provided template by adding a detailed description of the changes, a usage code snippet, and selecting the appropriate type of changes. Additionally, include any relevant CI pipeline labels and complete the pre-submit checklist with confirmations of local testing, documentation updates, and added or updated tests.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “Add new geneformer configs” succinctly captures the primary change of introducing new geneformer configuration files and is directly related to the main updates in the changeset. It is concise, clear, and specific enough that a reader scanning the history will understand the nature of the addition without extraneous detail.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jwilber/geneformer-configs

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b82e604 and 39a50ec.

📒 Files selected for processing (1)

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Jared Wilber <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (12)

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (4)
62-66: Checkpointing disabled but save interval set — align flags

With checkpoint_dir=null, keeping save_every_n_steps=100 can cause confusing behavior depending on trainer logic. Prefer disabling saves when dir is null.
- save_every_n_steps: 100
+ save_every_n_steps: 0
Also applies to: 102-104

54-54: Avoid numeric underscores for compatibility

10_000 may be parsed as a string by some YAML/OmegaConf parsers. Use 10000.
- num_train_steps: 10_000
+ num_train_steps: 10000
20-20: Name/flag mismatch (fp8 in recipe_subdir but fp8_enabled=false)

Either enable fp8, or drop fp8 from the identifier to avoid confusion.
- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp
Also applies to: 28-29

95-95: Disambiguate WandB group by model size

Optional: append ${config} so ddp/mfsdp runs are grouped per model size across recipes.
-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (4)
62-66: Disable save interval when checkpointing is off

Set save_every_n_steps: 0 to avoid mixed signals.
- save_every_n_steps: 100
+ save_every_n_steps: 0
Also applies to: 102-104

54-54: Prefer 10000 over 10_000

Avoid underscore for broader parser compatibility.
- num_train_steps: 10_000
+ num_train_steps: 10000
20-20: Identifier/flag mismatch (fp8 in name vs fp8_enabled=false)

Consider dropping fp8 from recipe_subdir or enabling fp8 if intended.
- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp
Also applies to: 28-29

95-95: Optional: include ${config} in WandB group

Helps group runs by model size.
-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (4)
62-66: If checkpointing is off, set save_every_n_steps to 0

Avoid passing save intervals alongside a null checkpoint_dir.
- save_every_n_steps: 100
+ save_every_n_steps: 0
Also applies to: 102-104

54-54: Use 10000 instead of 10_000

Safer cross-parser behavior.
- num_train_steps: 10_000
+ num_train_steps: 10000
20-20: recipe_subdir mentions fp8 while disabled

Align naming with flags to reduce confusion.
- recipe_subdir: geneformer_native_te_mfsdp_fp8
+ recipe_subdir: geneformer_native_te_mfsdp
Also applies to: 28-29

95-95: Optional: add ${config} to WandB group

Differentiates 10m vs other sizes.
-    +wandb_init_args.group=${wandb_init_args.group} \
+    +wandb_init_args.group=${wandb_init_args.group}__${config} \

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff2e4b and 28749ef.

📒 Files selected for processing (3)

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (1 hunks)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (1 hunks)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (6)

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (2)

93-97: wandb_init_args.mode is defined upstream — ci/lepton/model_convergence/configs/base.yaml sets it to "online", so no action required.

87-90: Referenced model config exists and is loadable by Hydra

Hydra will find hydra_config/106m.yaml (which defaults to model: 106m) and load hydra_config/model/106m.yaml when invoked with --config-name 106m.yaml.

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (2)

93-97: wandb_init_args.mode is defined in base.yaml (line 68), so the reference in this recipe is valid.

87-90: 4b.yaml is present and discoverable under the Hydra config directories
Files recipes/geneformer_native_te_mfsdp_fp8/hydra_config/4b.yaml and recipes/geneformer_native_te_mfsdp_fp8/hydra_config/model/4b.yaml exist.

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (2)

93-97: wandb_init_args.mode is defined in base.yaml (line 68); reference is valid.

87-90: Hydra ‘10m.yaml’ and its model config exist under recipes/geneformer_native_te_mfsdp_fp8/hydra_config, so --config-name 10m.yaml will resolve correctly.

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml

Signed-off-by: Jared Wilber <[email protected]>

add new geneformer configs

28749ef

Signed-off-by: Jared Wilber <[email protected]>

jwilber self-assigned this Oct 13, 2025

jwilber added the ciflow:skip Skip all CI tests for this PR label Oct 13, 2025

jwilber requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, pstjohn and trvachov as code owners October 13, 2025 19:22

reduce default resource usage

0e26bc1

Signed-off-by: Jared Wilber <[email protected]>

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml Outdated Show resolved Hide resolved

jwilber added 2 commits October 13, 2025 13:01

Set correct config

b82e604

Signed-off-by: Jared Wilber <[email protected]>

bump gpus to 8 to prevent oom

39a50ec

Signed-off-by: Jared Wilber <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new geneformer configs #1251

Add new geneformer configs #1251

Uh oh!

jwilber commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 13, 2025

Uh oh!

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add new geneformer configs #1251

Are you sure you want to change the base?

Add new geneformer configs #1251

Uh oh!

Conversation

jwilber commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 13, 2025

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jwilber commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading