Add Sound Encoder to Cosmos3 by MaciejBalaNV · Pull Request #13911 · huggingface/diffusers

MaciejBalaNV · 2026-06-10T13:21:27Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Maciej Bala <mbala@nvidia.com>

yiyixuxu · 2026-06-10T20:35:30Z

+    def _disable_encoder(self):
+        self.encoder = None
+        self._encoder_available = False
+        self.register_to_config(encoder_enabled=False)
+
+    def _fix_state_dict_keys_on_load(self, state_dict: OrderedDict) -> None:
+        super()._fix_state_dict_keys_on_load(state_dict)
+        if self.encoder is not None and not any(key.startswith("encoder.") for key in state_dict):
+            self._disable_encoder()
+


why do we need these two methods?

dg845 · 2026-06-11T02:12:07Z

        return hidden_states


+class Cosmos3AudioSnakeBeta(nn.Module):


It looks like the existing Snake1d module implements essentially the same logic as Cosmos3AudioSnakeBeta, could we use it as well for the encoder?

dg845 · 2026-06-11T02:13:23Z

+        return hidden_states + (beta + 1e-9).reciprocal() * torch.sin(alpha * hidden_states).pow(2)
+
+
+class Cosmos3AudioLayerNorm(nn.Module):


Could we potentially reuse the existing diffusers.models.normalization.FP32LayerNorm module here? Like Cosmos3AudioLayerNorm, it also upcasts the weight and bias (if available) to FP32:

diffusers/src/diffusers/models/normalization.py

Lines 87 to 93 in 0cc1cdb

return F.layer_norm(

inputs.float(),

self.normalized_shape,

self.weight.float() if self.weight is not None else None,

self.bias.float() if self.bias is not None else None,

self.eps,

).to(origin_dtype)

dg845 · 2026-06-11T02:16:48Z

+        self.pwconv2 = (
+            _zero_module(nn.Conv1d(intermediate_dim, hidden_dim, kernel_size=1))
+            if identity_init
+            else nn.Conv1d(intermediate_dim, hidden_dim, kernel_size=1)
+        )


Suggested change

self.pwconv2 = (

_zero_module(nn.Conv1d(intermediate_dim, hidden_dim, kernel_size=1))

if identity_init

else nn.Conv1d(intermediate_dim, hidden_dim, kernel_size=1)

)

self.pwconv2 = nn.Conv1d(intermediate_dim, hidden_dim, kernel_size=1)

if identity_init:

nn.init.zeros_(self.pwconv2.weight)

nn.init.zeros_(self.pwconv2.bias)

I think the above suggestion would be more clear and would allow us to remove the _zero_module helper method, as we prefer not to have too many small methods.

dg845 · 2026-06-11T02:17:49Z

+def _zero_module(module: nn.Module) -> nn.Module:
+    for parameter in module.parameters():
+        parameter.detach().zero_()
+    return module
+
+


Suggested change

def _zero_module(module: nn.Module) -> nn.Module:

for parameter in module.parameters():

parameter.detach().zero_()

return module

Follow up suggestion to #13911 (comment).

dg845 · 2026-06-11T02:23:37Z

+        if num_channels > 1:
+            audio = audio.reshape(batch_size * num_channels, 1, num_samples)
+
+        with torch.autocast(device_type=audio.device.type, enabled=False):


Would it be possible to remove the autocast region here? We generally prefer not to use autocast regions and I think mixed-precision training would still work correctly without it (since _spectrogram doesn't use any ops that could be dispatched to a lower precision dtype).

dg845 · 2026-06-11T02:25:17Z

        return hidden_state


+class Cosmos3AudioDiagonalGaussianDistribution:


Would it be possible to reuse the existing OobleckDiagonalGaussianDistribution module here? I believe the logic is essentially the same as in Cosmos3AudioDiagonalGaussianDistribution.

dg845 · 2026-06-11T02:37:02Z

+        encoder_dtype = next(self.encoder.parameters()).dtype if self.encoder is not None else hidden_states.dtype
+        moments = self._encode(hidden_states.to(dtype=encoder_dtype))


Suggested change

encoder_dtype = next(self.encoder.parameters()).dtype if self.encoder is not None else hidden_states.dtype

moments = self._encode(hidden_states.to(dtype=encoder_dtype))

encoder_dtype = get_parameter_dtype(self.encoder) if self.encoder is not None else hidden_states.dtype

moments = self._encode(hidden_states.to(dtype=encoder_dtype))

Using diffusers.models.modeling_utils.get_parameter_dtype here is more robust to things like layerwise casting, where the storage dtype (for self.encoder's weights) may differ from the compute dtype (which we want hidden_states to be in).

dg845 · 2026-06-11T02:38:24Z

+        if self.encoder is None or not self._encoder_available:
+            raise ValueError(
+                "This Cosmos3 AVAE sound tokenizer was loaded from decoder-only weights and cannot encode audio. "
+                "Re-convert the AVAE checkpoint with encoder weights to use `encode()`."
+            )


I think it might make more sense to move this check into encode, so that we fail earlier.

dg845

Thanks for the PR! Left an initial design review :).

Initial version of sound encoder

0ffee41

Signed-off-by: Maciej Bala <mbala@nvidia.com>

github-actions Bot added models tests size/L PR with diff > 200 LOC labels Jun 10, 2026

yiyixuxu reviewed Jun 10, 2026

View reviewed changes

yiyixuxu requested a review from dg845 June 10, 2026 20:37

dg845 reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sound Encoder to Cosmos3#13911

Add Sound Encoder to Cosmos3#13911
MaciejBalaNV wants to merge 1 commit into
huggingface:mainfrom
MaciejBalaNV:cosmos3_sound_encoder

MaciejBalaNV commented Jun 10, 2026

Uh oh!

yiyixuxu Jun 10, 2026 •

edited

Loading

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 Jun 11, 2026

Uh oh!

dg845 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return hidden_states + (beta + 1e-9).reciprocal() * torch.sin(alpha * hidden_states).pow(2)


		class Cosmos3AudioLayerNorm(nn.Module):

	return F.layer_norm(
	inputs.float(),
	self.normalized_shape,
	self.weight.float() if self.weight is not None else None,
	self.bias.float() if self.bias is not None else None,
	self.eps,
	).to(origin_dtype)

		return hidden_state


		class Cosmos3AudioDiagonalGaussianDistribution:

		encoder_dtype = next(self.encoder.parameters()).dtype if self.encoder is not None else hidden_states.dtype
		moments = self._encode(hidden_states.to(dtype=encoder_dtype))

Conversation

MaciejBalaNV commented Jun 10, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

yiyixuxu Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiyixuxu Jun 10, 2026 •

edited

Loading