Skip to content

[TRTLLM-13120][feat] Cosmos3 Audio Output Support#14827

Merged
bobboli merged 20 commits into
NVIDIA:mainfrom
NVShreyas:user/shreyasm/cosmos3-audio
Jul 2, 2026
Merged

[TRTLLM-13120][feat] Cosmos3 Audio Output Support#14827
bobboli merged 20 commits into
NVIDIA:mainfrom
NVShreyas:user/shreyasm/cosmos3-audio

Conversation

@NVShreyas

@NVShreyas NVShreyas commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • New Features
    • Added audio generation capability to Cosmos3 visual generation pipeline, allowing optional audio output alongside video.
    • New enable_audio parameter enables audio generation per inference request with decoded audio waveforms.
    • Pipeline output now includes audio data and corresponding sample rate information.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@NVShreyas NVShreyas requested a review from a team as a code owner June 1, 2026 15:57
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds audio generation support to Cosmos3OmniMoTPipeline alongside video. Audio generation is gated by an enable_audio flag, with an audio decoder (LatentAutoEncoderV2) loaded conditionally, separate scheduler state maintained, transformer changes to embed/inject audio tokens, and full audio decoding and return in pipeline output.

Changes

Cosmos3 Audio Generation Integration

Layer / File(s) Summary
Configuration & base pipeline updates
tensorrt_llm/_torch/visual_gen/models/cosmos3/defaults.py, tensorrt_llm/_torch/visual_gen/pipeline_registry.py, tensorrt_llm/_torch/visual_gen/pipeline.py
enable_audio boolean parameter added to COSMOS3_EXTRA_SPECS; PipelineComponent.SOUND_TOKENIZER enum member registered; BasePipeline.dtype changed to return model_config.torch_dtype instead of inferring from transformer parameters.
Audio decoder module building blocks
tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py
SnakeBeta sinusoidal activation with trainable frequency/magnitude, weight-normalized convolution wrappers (WNConv1d, WNConvTranspose1d), normalization selection (get_norm_module, apply_parametrization_norm), 1D padding utilities with reflect-mode handling, and NormConvTranspose1d / SConvTranspose1d transposed convolution modules with causal trimming.
Sound tokenizer and decoder architecture
tensorrt_llm/_torch/visual_gen/models/cosmos3/sound_tokenizer.py
LatentAutoEncoderV2 decoder wrapping OobleckDecoder with configuration-driven upsampling blocks, residual units, activation selection, and from_pretrained loader that extracts decoder-only weights from safetensors, validates completeness, removes weight norm, and freezes parameters.
Transformer audio support and token injection
tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py
TransformerOutput dataclass returns video/image/audio/action; Cosmos3VFMTransformer initialization extended with audio projection/embedding layers; audio latents packed/unpacked between tensor layouts; RoPE computed for audio tokens; GEN path refactored with SequenceSharder for token sharding; audio tokens concatenated to sequence; checkpoint loading remaps audio projection keys; dtype casting applied to audio parameters.
Pipeline audio generation orchestration
tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py
Conditional audio tokenizer loading; separate audio_scheduler for independent audio/video state; audio latent initialization in forward() when enabled; audio threaded through transformer denoising with dual streams; audio decoding post-denoising; PipelineOutput returns video, frame_rate, audio, and audio_sample_rate when audio is present.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description section is empty, with only template comments and checklist items remaining. No explanation of the issue, solution, or test coverage was provided. Add a clear description explaining what audio output support entails, why it was added, what tests verify the feature, and confirm checklist items are actually met.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The pull request title clearly and specifically describes the main feature being added: audio output support for the Cosmos3 model.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py (1)

1098-1118: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize the model. prefix before matching audio keys.

The new audio remaps run before Lines 1116-1118 strip a leading model. prefix, so checkpoint entries like model.audio_proj_in.weight and model.audio_modality_embed fall through and get skipped. That leaves the audio path partially unloaded for checkpoints using the same prefix convention as the rest of the transformer weights.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py` around
lines 1098 - 1118, The audio-related checkpoint keys (prefixes "audio_proj_in.",
"audio_proj_out.", "audio_modality_embed", "time_embedder.linear") are being
checked before the code strips a leading "model." prefix, so entries like
"model.audio_proj_in.weight" are missed; fix by normalizing the key early (e.g.,
update the variable k by stripping "model." when present) before the audio remap
checks in the same block (ensure the "model." handling around k =
k[len("model.") :] happens before the checks for "audio_proj_in.",
"audio_proj_out.", "audio_modality_embed", and "time_embedder.linear") so all
audio keys with or without the "model." prefix are correctly remapped.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py`:
- Around line 100-112: get_norm_module currently misses the "time_layer_norm"
case and incorrectly uses nn.LayerNorm(module.out_channels) directly (LayerNorm
expects the normalized dimension to be last), causing wrong behavior for tensors
shaped [N,C,T]; update get_norm_module to (1) add a branch for norm ==
"time_layer_norm" that returns a module which applies LayerNorm over the channel
dimension by permuting [N,C,T] -> [N,T,C], applying
nn.LayerNorm(module.out_channels), then permuting back, and (2) fix the existing
"layer_norm" branch to similarly wrap nn.LayerNorm so it normalizes channels
correctly for ConvNd outputs; keep the causal/group-norm check intact and still
assert module is an instance of nn.modules.conv._ConvNd and refer to
get_norm_module and CONV_NORMALIZATIONS when making the changes.
- Around line 115-129: The pad1d function's default mode "zero" is invalid for
torch.nn.functional.pad; update pad1d (function pad1d) to use a valid default
such as mode="constant" (which preserves the existing value parameter semantics)
and ensure any callers expecting "zero" continue to work by using the value
argument; adjust the function signature default and keep the existing
reflect-handling logic intact so F.pad is always invoked with a supported mode.

In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`:
- Around line 686-695: The audio config fallback uses getattr(...,
"<new_field>", pretrained_config.<legacy_field>) which eagerly evaluates the
legacy attribute and can raise AttributeError if legacy keys are missing; change
the audio field assignments in transformer_cosmos3 (audio_dim, audio_latent_fps,
temporal_compression_factor_audio) to safely check for legacy keys (e.g., use
nested getattr or hasattr/try-except to read pretrained_config.sound_* only if
present) so the fallback is evaluated lazily. Also in load_weights(), the
remapping for audio keys (audio_proj_in.*, audio_proj_out.*,
audio_modality_embed) happens before the code strips the leading "model."
prefix, so normalize checkpoint keys by removing the "model." prefix first (or
re-run the audio remap after normalization) so keys like "model.audio_proj_in.*"
correctly match the remapping branches.

---

Outside diff comments:
In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`:
- Around line 1098-1118: The audio-related checkpoint keys (prefixes
"audio_proj_in.", "audio_proj_out.", "audio_modality_embed",
"time_embedder.linear") are being checked before the code strips a leading
"model." prefix, so entries like "model.audio_proj_in.weight" are missed; fix by
normalizing the key early (e.g., update the variable k by stripping "model."
when present) before the audio remap checks in the same block (ensure the
"model." handling around k = k[len("model.") :] happens before the checks for
"audio_proj_in.", "audio_proj_out.", "audio_modality_embed", and
"time_embedder.linear") so all audio keys with or without the "model." prefix
are correctly remapped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed333cb8-dd36-4030-acfd-3b1c5628f0c3

📥 Commits

Reviewing files that changed from the base of the PR and between 06456e1 and 145755d.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/visual_gen/models/cosmos3/defaults.py
  • tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py
  • tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py
  • tensorrt_llm/_torch/visual_gen/models/cosmos3/sound_tokenizer.py
  • tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py
  • tensorrt_llm/_torch/visual_gen/pipeline.py
  • tensorrt_llm/_torch/visual_gen/pipeline_registry.py

Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py
Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py Outdated
@NVShreyas NVShreyas changed the title [None][feat] Cosmos3 Audio Output Support [TRTLLM-13120][feat] Cosmos3 Audio Output Support Jun 1, 2026
@NVShreyas NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch 2 times, most recently from 0c3d063 to 8aa0e91 Compare June 5, 2026 18:41
@NVShreyas NVShreyas requested a review from a team as a code owner June 8, 2026 19:37
@NVShreyas NVShreyas requested review from QiJune and Shixiaowei02 June 8, 2026 19:37
@NVShreyas NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 350468f to 217f591 Compare June 9, 2026 14:52
Comment thread examples/visual_gen/models/cosmos3_ti2v.py Outdated
Comment thread examples/visual_gen/models/cosmos3/cosmos3_negative_prompt.json
Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py
Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py Outdated
Comment thread examples/visual_gen/configs/cosmos3-nano-1gpu.yaml
Comment thread examples/visual_gen/models/cosmos3/cosmos3.py
Comment thread examples/visual_gen/models/cosmos3_ti2v.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/defaults.py Outdated
@NVShreyas NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 16bce31 to 8673740 Compare June 12, 2026 15:43
Comment thread tensorrt_llm/_torch/visual_gen/pipeline.py
@NVShreyas

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53961 [ run ] triggered by Bot. Commit: b656515 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53961 [ run ] completed with state FAILURE. Commit: b656515
/LLM/main/L0_MergeRequest_PR pipeline #43052 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@NVShreyas NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 178450d to ca7a790 Compare June 22, 2026 14:28
@NVShreyas

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55043 [ run ] triggered by Bot. Commit: ca7a790 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55043 [ run ] completed with state FAILURE. Commit: ca7a790
/LLM/main/L0_MergeRequest_PR pipeline #44034 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rahul-steiger-nv rahul-steiger-nv force-pushed the user/shreyasm/cosmos3-audio branch from ca7a790 to 7d46330 Compare June 24, 2026 08:41
@rahul-steiger-nv

Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55461 [ run ] triggered by Bot. Commit: 7d46330 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55461 [ run ] completed with state FAILURE. Commit: 7d46330
/LLM/main/L0_MergeRequest_PR pipeline #44389 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli

bobboli commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Hi @NVShreyas @rahul-steiger-nv,

15417 will require a small update here, but it should not block this PR.

The scheduler/pipeline logic should still compare against raw t.

15417 only changes the transformer-forward contract:

  • timestep = normalized value for attention/model-agnostic scheduling
  • raw_timestep = raw scheduler value for Cosmos time embedding

So after 15417, this PR should update the transformer call from:

timestep=timestep,
attention_timestep=timestep / self.scheduler.config.num_train_timesteps,

to:

timestep=timestep / self.scheduler.config.num_train_timesteps,
raw_timestep=timestep,

This preserves the 15545 fix because Cosmos time embedding still uses the raw scheduler timestep.

I can make the adjustment after this PR lands, or this PR can rebase after 15417 lands. Either path is fine. Thanks!

@rahul-steiger-nv

Copy link
Copy Markdown
Collaborator

I can make the adjustment after this PR lands, or this PR can rebase after 15417 lands. Either path is fine. Thanks!

Thanks, that sounds good. I’m happy to rebase after 15417 lands and make the small call-site update in this PR.

@rahul-steiger-nv

Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55867 [ run ] triggered by Bot. Commit: 7d46330 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56966 [ run ] completed with state SUCCESS. Commit: 2f9da72
/LLM/main/L0_MergeRequest_PR pipeline #45767 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

NVShreyas added 19 commits July 1, 2026 14:31
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 2f9da72 to 0601b25 Compare July 1, 2026 19:31
…lision

- test_cosmos3_example: point script_path at the reorganized
  examples/visual_gen/models/cosmos3/cosmos3.py (was flat cosmos3_ti2v.py)
- _run_forward: accept height/width/guidance_scale overrides so
  test_t2i_smoke no longer passes duplicate keyword arguments to forward()

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57013 [ run ] triggered by Bot. Commit: 5b9dd9e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57013 [ run ] completed with state SUCCESS. Commit: 5b9dd9e
/LLM/main/L0_MergeRequest_PR pipeline #45812 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bobboli bobboli enabled auto-merge (squash) July 2, 2026 06:33
Comment thread examples/visual_gen/configs/cosmos3-nano-1gpu.yaml
@bobboli bobboli merged commit f50ca53 into NVIDIA:main Jul 2, 2026
8 checks passed
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Jul 2, 2026
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants