-
Notifications
You must be signed in to change notification settings - Fork 101
Add new geneformer configs #1251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jared Wilber <[email protected]>
WalkthroughAdds three Lepton YAML recipes for geneformer_native_te (10m, 106m, 4b) that define job metadata, device/resource shapes, precision/TE/FP8/THD flags, wandb init args, shared training defaults, two product variants (ddp and mfsdp), checkpoint controls, and a torchrun-based run_script. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator
participant Lepton as Lepton Job
participant TorchRun
participant Trainer
participant WandB as W&B
Operator->>Lepton: select recipe + product (10m/106m/4b, ddp/mfsdp)
Lepton->>TorchRun: launch with config overrides (nodes, devices, precision, TE, FP8, MFS DP)
TorchRun->>Trainer: init model & training (TE, fp16/fp8, mfsdp flag)
Trainer->>WandB: init(project, group, job_type, name)
Note over Trainer,WandB: observability started
loop training steps
Trainer->>Trainer: forward/backward/opt step
alt mfsdp enabled
Trainer->>Trainer: sharded param sync
else ddp
Trainer->>Trainer: data-parallel sync
end
alt checkpointing enabled
Trainer->>Trainer: save checkpoint at interval
end
end
Trainer-->>WandB: finalize run
TorchRun-->>Lepton: exit status
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Jared Wilber <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (12)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (4)
62-66: Checkpointing disabled but save interval set — align flagsWith checkpoint_dir=null, keeping save_every_n_steps=100 can cause confusing behavior depending on trainer logic. Prefer disabling saves when dir is null.
- save_every_n_steps: 100 + save_every_n_steps: 0Also applies to: 102-104
54-54: Avoid numeric underscores for compatibility10_000 may be parsed as a string by some YAML/OmegaConf parsers. Use 10000.
- num_train_steps: 10_000 + num_train_steps: 10000
20-20: Name/flag mismatch (fp8 in recipe_subdir but fp8_enabled=false)Either enable fp8, or drop fp8 from the identifier to avoid confusion.
- recipe_subdir: geneformer_native_te_mfsdp_fp8 + recipe_subdir: geneformer_native_te_mfsdpAlso applies to: 28-29
95-95: Disambiguate WandB group by model sizeOptional: append ${config} so ddp/mfsdp runs are grouped per model size across recipes.
- +wandb_init_args.group=${wandb_init_args.group} \ + +wandb_init_args.group=${wandb_init_args.group}__${config} \ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (4)
62-66: Disable save interval when checkpointing is offSet save_every_n_steps: 0 to avoid mixed signals.
- save_every_n_steps: 100 + save_every_n_steps: 0Also applies to: 102-104
54-54: Prefer 10000 over 10_000Avoid underscore for broader parser compatibility.
- num_train_steps: 10_000 + num_train_steps: 10000
20-20: Identifier/flag mismatch (fp8 in name vs fp8_enabled=false)Consider dropping fp8 from recipe_subdir or enabling fp8 if intended.
- recipe_subdir: geneformer_native_te_mfsdp_fp8 + recipe_subdir: geneformer_native_te_mfsdpAlso applies to: 28-29
95-95: Optional: include ${config} in WandB groupHelps group runs by model size.
- +wandb_init_args.group=${wandb_init_args.group} \ + +wandb_init_args.group=${wandb_init_args.group}__${config} \ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (4)
62-66: If checkpointing is off, set save_every_n_steps to 0Avoid passing save intervals alongside a null checkpoint_dir.
- save_every_n_steps: 100 + save_every_n_steps: 0Also applies to: 102-104
54-54: Use 10000 instead of 10_000Safer cross-parser behavior.
- num_train_steps: 10_000 + num_train_steps: 10000
20-20: recipe_subdir mentions fp8 while disabledAlign naming with flags to reduce confusion.
- recipe_subdir: geneformer_native_te_mfsdp_fp8 + recipe_subdir: geneformer_native_te_mfsdpAlso applies to: 28-29
95-95: Optional: add ${config} to WandB groupDifferentiates 10m vs other sizes.
- +wandb_init_args.group=${wandb_init_args.group} \ + +wandb_init_args.group=${wandb_init_args.group}__${config} \
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml(1 hunks)ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml(1 hunks)ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (6)
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_106m.yaml (2)
93-97: wandb_init_args.mode is defined upstream —ci/lepton/model_convergence/configs/base.yamlsets it to"online", so no action required.
87-90: Referenced model config exists and is loadable by HydraHydra will find
hydra_config/106m.yaml(which defaults tomodel: 106m) and loadhydra_config/model/106m.yamlwhen invoked with--config-name 106m.yaml.ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml (2)
93-97: wandb_init_args.mode is defined in base.yaml (line 68), so the reference in this recipe is valid.
87-90: 4b.yaml is present and discoverable under the Hydra config directories
Filesrecipes/geneformer_native_te_mfsdp_fp8/hydra_config/4b.yamlandrecipes/geneformer_native_te_mfsdp_fp8/hydra_config/model/4b.yamlexist.ci/lepton/model_convergence/configs/recipes/geneformer_native_te_10m.yaml (2)
93-97: wandb_init_args.mode is defined in base.yaml (line 68); reference is valid.
87-90: Hydra ‘10m.yaml’ and its model config exist under recipes/geneformer_native_te_mfsdp_fp8/hydra_config, so--config-name 10m.yamlwill resolve correctly.
ci/lepton/model_convergence/configs/recipes/geneformer_native_te_4b.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Jared Wilber <[email protected]>
Signed-off-by: Jared Wilber <[email protected]>
Adding new geneformer scripts, one for each of the
10m,106m, and4bmodels.Summary by CodeRabbit