[TRTLLM-13120][feat] Cosmos3 Audio Output Support by NVShreyas · Pull Request #14827 · NVIDIA/TensorRT-LLM

NVShreyas · 2026-06-01T15:57:04Z

Summary by CodeRabbit

Release Notes

New Features
- Added audio generation capability to Cosmos3 visual generation pipeline, allowing optional audio output alongside video.
- New enable_audio parameter enables audio generation per inference request with decoded audio waveforms.
- Pipeline output now includes audio data and corresponding sample rate information.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-01T16:08:05Z

📝 Walkthrough

Walkthrough

This PR adds audio generation support to Cosmos3OmniMoTPipeline alongside video. Audio generation is gated by an enable_audio flag, with an audio decoder (LatentAutoEncoderV2) loaded conditionally, separate scheduler state maintained, transformer changes to embed/inject audio tokens, and full audio decoding and return in pipeline output.

Changes

Cosmos3 Audio Generation Integration

Layer / File(s)	Summary
Configuration & base pipeline updates `tensorrt_llm/_torch/visual_gen/models/cosmos3/defaults.py`, `tensorrt_llm/_torch/visual_gen/pipeline_registry.py`, `tensorrt_llm/_torch/visual_gen/pipeline.py`	`enable_audio` boolean parameter added to `COSMOS3_EXTRA_SPECS`; `PipelineComponent.SOUND_TOKENIZER` enum member registered; `BasePipeline.dtype` changed to return `model_config.torch_dtype` instead of inferring from transformer parameters.
Audio decoder module building blocks `tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py`	`SnakeBeta` sinusoidal activation with trainable frequency/magnitude, weight-normalized convolution wrappers (`WNConv1d`, `WNConvTranspose1d`), normalization selection (`get_norm_module`, `apply_parametrization_norm`), 1D padding utilities with reflect-mode handling, and `NormConvTranspose1d` / `SConvTranspose1d` transposed convolution modules with causal trimming.
Sound tokenizer and decoder architecture `tensorrt_llm/_torch/visual_gen/models/cosmos3/sound_tokenizer.py`	`LatentAutoEncoderV2` decoder wrapping `OobleckDecoder` with configuration-driven upsampling blocks, residual units, activation selection, and `from_pretrained` loader that extracts decoder-only weights from safetensors, validates completeness, removes weight norm, and freezes parameters.
Transformer audio support and token injection `tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`	`TransformerOutput` dataclass returns video/image/audio/action; `Cosmos3VFMTransformer` initialization extended with audio projection/embedding layers; audio latents packed/unpacked between tensor layouts; RoPE computed for audio tokens; GEN path refactored with `SequenceSharder` for token sharding; audio tokens concatenated to sequence; checkpoint loading remaps audio projection keys; dtype casting applied to audio parameters.
Pipeline audio generation orchestration `tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py`	Conditional audio tokenizer loading; separate `audio_scheduler` for independent audio/video state; audio latent initialization in `forward()` when enabled; audio threaded through transformer denoising with dual streams; audio decoding post-denoising; `PipelineOutput` returns `video`, `frame_rate`, `audio`, and `audio_sample_rate` when audio is present.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.74% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description section is empty, with only template comments and checklist items remaining. No explanation of the issue, solution, or test coverage was provided.	Add a clear description explaining what audio output support entails, why it was added, what tests verify the feature, and confirm checklist items are actually met.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The pull request title clearly and specifically describes the main feature being added: audio output support for the Cosmos3 model.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py (1)
1098-1118: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize the model. prefix before matching audio keys.

The new audio remaps run before Lines 1116-1118 strip a leading model. prefix, so checkpoint entries like model.audio_proj_in.weight and model.audio_modality_embed fall through and get skipped. That leaves the audio path partially unloaded for checkpoints using the same prefix convention as the rest of the transformer weights.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py` around
lines 1098 - 1118, The audio-related checkpoint keys (prefixes "audio_proj_in.",
"audio_proj_out.", "audio_modality_embed", "time_embedder.linear") are being
checked before the code strips a leading "model." prefix, so entries like
"model.audio_proj_in.weight" are missed; fix by normalizing the key early (e.g.,
update the variable k by stripping "model." when present) before the audio remap
checks in the same block (ensure the "model." handling around k =
k[len("model.") :] happens before the checks for "audio_proj_in.",
"audio_proj_out.", "audio_modality_embed", and "time_embedder.linear") so all
audio keys with or without the "model." prefix are correctly remapped.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py`:
- Around line 100-112: get_norm_module currently misses the "time_layer_norm"
case and incorrectly uses nn.LayerNorm(module.out_channels) directly (LayerNorm
expects the normalized dimension to be last), causing wrong behavior for tensors
shaped [N,C,T]; update get_norm_module to (1) add a branch for norm ==
"time_layer_norm" that returns a module which applies LayerNorm over the channel
dimension by permuting [N,C,T] -> [N,T,C], applying
nn.LayerNorm(module.out_channels), then permuting back, and (2) fix the existing
"layer_norm" branch to similarly wrap nn.LayerNorm so it normalizes channels
correctly for ConvNd outputs; keep the causal/group-norm check intact and still
assert module is an instance of nn.modules.conv._ConvNd and refer to
get_norm_module and CONV_NORMALIZATIONS when making the changes.
- Around line 115-129: The pad1d function's default mode "zero" is invalid for
torch.nn.functional.pad; update pad1d (function pad1d) to use a valid default
such as mode="constant" (which preserves the existing value parameter semantics)
and ensure any callers expecting "zero" continue to work by using the value
argument; adjust the function signature default and keep the existing
reflect-handling logic intact so F.pad is always invoked with a supported mode.

In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`:
- Around line 686-695: The audio config fallback uses getattr(...,
"<new_field>", pretrained_config.<legacy_field>) which eagerly evaluates the
legacy attribute and can raise AttributeError if legacy keys are missing; change
the audio field assignments in transformer_cosmos3 (audio_dim, audio_latent_fps,
temporal_compression_factor_audio) to safely check for legacy keys (e.g., use
nested getattr or hasattr/try-except to read pretrained_config.sound_* only if
present) so the fallback is evaluated lazily. Also in load_weights(), the
remapping for audio keys (audio_proj_in.*, audio_proj_out.*,
audio_modality_embed) happens before the code strips the leading "model."
prefix, so normalize checkpoint keys by removing the "model." prefix first (or
re-run the audio remap after normalization) so keys like "model.audio_proj_in.*"
correctly match the remapping branches.

---

Outside diff comments:
In `@tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`:
- Around line 1098-1118: The audio-related checkpoint keys (prefixes
"audio_proj_in.", "audio_proj_out.", "audio_modality_embed",
"time_embedder.linear") are being checked before the code strips a leading
"model." prefix, so entries like "model.audio_proj_in.weight" are missed; fix by
normalizing the key early (e.g., update the variable k by stripping "model."
when present) before the audio remap checks in the same block (ensure the
"model." handling around k = k[len("model.") :] happens before the checks for
"audio_proj_in.", "audio_proj_out.", "audio_modality_embed", and
"time_embedder.linear") so all audio keys with or without the "model." prefix
are correctly remapped.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed333cb8-dd36-4030-acfd-3b1c5628f0c3

📥 Commits

Reviewing files that changed from the base of the PR and between 06456e1 and 145755d.

📒 Files selected for processing (7)

tensorrt_llm/_torch/visual_gen/models/cosmos3/defaults.py
tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py
tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py
tensorrt_llm/_torch/visual_gen/models/cosmos3/sound_tokenizer.py
tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py
tensorrt_llm/_torch/visual_gen/pipeline.py
tensorrt_llm/_torch/visual_gen/pipeline_registry.py

NVShreyas · 2026-06-12T21:50:52Z

/bot run

tensorrt-cicd · 2026-06-12T21:57:00Z

PR_Github #53961 [ run ] triggered by Bot. Commit: b656515 Link to invocation

tensorrt-cicd · 2026-06-13T00:13:49Z

PR_Github #53961 [ run ] completed with state FAILURE. Commit: b656515
/LLM/main/L0_MergeRequest_PR pipeline #43052 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

NVShreyas · 2026-06-22T14:50:28Z

/bot run

tensorrt-cicd · 2026-06-22T14:59:59Z

PR_Github #55043 [ run ] triggered by Bot. Commit: ca7a790 Link to invocation

tensorrt-cicd · 2026-06-22T15:11:24Z

PR_Github #55043 [ run ] completed with state FAILURE. Commit: ca7a790
/LLM/main/L0_MergeRequest_PR pipeline #44034 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rahul-steiger-nv · 2026-06-24T08:41:31Z

/bot run

tensorrt-cicd · 2026-06-24T08:47:03Z

PR_Github #55461 [ run ] triggered by Bot. Commit: 7d46330 Link to invocation

tensorrt-cicd · 2026-06-24T09:03:25Z

PR_Github #55461 [ run ] completed with state FAILURE. Commit: 7d46330
/LLM/main/L0_MergeRequest_PR pipeline #44389 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-06-25T16:37:25Z

Hi @NVShreyas @rahul-steiger-nv,

15417 will require a small update here, but it should not block this PR.

The scheduler/pipeline logic should still compare against raw t.

15417 only changes the transformer-forward contract:

timestep = normalized value for attention/model-agnostic scheduling
raw_timestep = raw scheduler value for Cosmos time embedding

So after 15417, this PR should update the transformer call from:

timestep=timestep,
attention_timestep=timestep / self.scheduler.config.num_train_timesteps,

to:

timestep=timestep / self.scheduler.config.num_train_timesteps,
raw_timestep=timestep,

This preserves the 15545 fix because Cosmos time embedding still uses the raw scheduler timestep.

I can make the adjustment after this PR lands, or this PR can rebase after 15417 lands. Either path is fine. Thanks!

rahul-steiger-nv · 2026-06-25T19:04:45Z

I can make the adjustment after this PR lands, or this PR can rebase after 15417 lands. Either path is fine. Thanks!

Thanks, that sounds good. I’m happy to rebase after 15417 lands and make the small call-site update in this PR.

rahul-steiger-nv · 2026-06-25T19:05:16Z

/bot run

tensorrt-cicd · 2026-06-25T19:12:50Z

PR_Github #55867 [ run ] triggered by Bot. Commit: 7d46330 Link to invocation

tensorrt-cicd · 2026-07-01T19:21:50Z

PR_Github #56966 [ run ] completed with state SUCCESS. Commit: 2f9da72
/LLM/main/L0_MergeRequest_PR pipeline #45767 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

…lision - test_cosmos3_example: point script_path at the reorganized examples/visual_gen/models/cosmos3/cosmos3.py (was flat cosmos3_ti2v.py) - _run_forward: accept height/width/guidance_scale overrides so test_t2i_smoke no longer passes duplicate keyword arguments to forward() Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-07-01T19:35:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-01T19:45:31Z

PR_Github #57013 [ run ] triggered by Bot. Commit: 5b9dd9e Link to invocation

tensorrt-cicd · 2026-07-01T23:34:08Z

PR_Github #57013 [ run ] completed with state SUCCESS. Commit: 5b9dd9e
/LLM/main/L0_MergeRequest_PR pipeline #45812 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas requested a review from a team as a code owner June 1, 2026 15:57

NVShreyas added the VisualGen label Jun 1, 2026

github-actions Bot assigned NVShreyas Jun 1, 2026

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py

Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/modules.py Outdated

Comment thread tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py

NVShreyas changed the title ~~[None][feat] Cosmos3 Audio Output Support~~ [TRTLLM-13120][feat] Cosmos3 Audio Output Support Jun 1, 2026

NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch 2 times, most recently from 0c3d063 to 8aa0e91 Compare June 5, 2026 18:41

NVShreyas requested a review from a team as a code owner June 8, 2026 19:37

NVShreyas requested review from QiJune and Shixiaowei02 June 8, 2026 19:37

NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 350468f to 217f591 Compare June 9, 2026 14:52

chang-l reviewed Jun 11, 2026

View reviewed changes

NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 16bce31 to 8673740 Compare June 12, 2026 15:43

NVShreyas commented Jun 12, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/visual_gen/pipeline.py

NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 178450d to ca7a790 Compare June 22, 2026 14:28

rahul-steiger-nv force-pushed the user/shreyasm/cosmos3-audio branch from ca7a790 to 7d46330 Compare June 24, 2026 08:41

NVShreyas added 19 commits July 1, 2026 14:31

add sound tokenizer

6e616eb

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

full sound pipeline

bc97534

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

working implementation with sound

006eea6

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

simplify - remove encoder

05cb320

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

add enable_sound req param

4d970f6

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

working with new checkpoint

f25540c

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

Fix latent dims for audio

c0ee3a4

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

tests with audio enabled

eb69ef8

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

address coderabbit comments

913d9ed

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

update defaults

924dfab

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

t2i updates

e3fc639

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

use per batch attention with sliced KV

f2a4201

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

guidance interval to base pipeline

3d3cfe9

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

add neg prompt example

dfc092a

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

tests

1f29600

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

address comments

bc08326

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

image url support

ea6a254

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

generalize example script and update docstring

83e7ca2

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

address comments

0601b25

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas force-pushed the user/shreyasm/cosmos3-audio branch from 2f9da72 to 0601b25 Compare July 1, 2026 19:31

bobboli enabled auto-merge (squash) July 2, 2026 06:33

zhenhuaw-me approved these changes Jul 2, 2026

View reviewed changes

Comment thread examples/visual_gen/configs/cosmos3-nano-1gpu.yaml

bobboli merged commit f50ca53 into NVIDIA:main Jul 2, 2026
8 checks passed

bobboli mentioned this pull request Jul 2, 2026

[None][refactor] Refine Skip Softmax follow-ups #15417

Open

evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Jul 2, 2026

[TRTLLM-13120][feat] Cosmos3 Audio Output Support (NVIDIA#14827)

267fe3f

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

Uh oh!

Conversation

NVShreyas commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NVShreyas commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

NVShreyas commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

rahul-steiger-nv commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

bobboli commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-steiger-nv commented Jun 25, 2026

Uh oh!

rahul-steiger-nv commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

NVShreyas commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

NVShreyas commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

bobboli commented Jun 25, 2026 •

edited

Loading