[feat] Add JoyAI-Echo multi-shot audio-video generation pipeline by sjq66 · Pull Request #13910 · huggingface/diffusers

sjq66 · 2026-06-10T12:15:07Z

What does this PR do?

We are the JoyAI Team (JD.com), and this is the Diffusers implementation for the JoyAI-Echo model.

GitHub Repository: https://github.com/jd-opensource/JoyAI-Echo
HuggingFace Weights: https://huggingface.co/jdopensource/JoyAI-Echo

Model Overview

JoyAI-Echo is a unified framework for long-form audio-visual generation that supports minute-level multi-shot video creation with synchronized audio, strong temporal consistency, and real-time interaction.

Key Features

🎞️ Minute-level multi-shot stories: generate a sequence of coherent shots from a list of prompts
⚡ DMD-distilled few-step inference: ~7.5× faster than the original pipeline
🔊 Joint audio-video generation: one pipeline produces synchronized video and audio
🧠 Paired cross-modal memory bank: conditions each new shot on prior visual identity and voice context for story-level consistency

Implementation Details

New files added:

src/diffusers/models/transformers/transformer_joyai_echo.py — JoyAIEchoTransformer3DModel, extends LTX2VideoTransformer3DModel with memory mask support for multi-shot generation
src/diffusers/pipelines/joyai_echo/pipeline_joyai_echo.py — JoyAIEchoPipeline, multi-shot pipeline with paired audio-video memory bank
src/diffusers/pipelines/joyai_echo/pipeline_joyai_echo_original_checkpoint.py — JoyAIEchoOriginalCheckpointPipeline for loading original checkpoints
src/diffusers/pipelines/joyai_echo/pipeline_output.py — Output dataclasses
docs/source/en/api/pipelines/joyai_echo.md — API documentation
tests/pipelines/joyai_echo/test_joyai_echo.py — Pipeline tests

Usage example:

import torch
from diffusers import JoyAIEchoPipeline
from diffusers.utils import encode_video

pipe = JoyAIEchoPipeline.from_pretrained("jdopensource/JoyAI-Echo", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

output = pipe(
    [
        "A cinematic opening shot of the protagonist entering a quiet train station.",
        "The same protagonist speaks softly while the camera follows through the platform.",
    ],
    height=736,
    width=1280,
    num_frames=241,
    frame_rate=25.0,
)

for i, (frames, audio) in enumerate(zip(output.frames, output.audio)):
    encode_video(frames[0], fps=25, audio=audio[0].float().cpu(), output_path=f"shot_{i:03d}.mp4")

Tests

$ python -m pytest tests/pipelines/joyai_echo/ -v
tests/pipelines/joyai_echo/test_joyai_echo.py::JoyAIEchoPipelineFastTests::test_multishot_memory_prefix PASSED [100%]
======================== 1 passed in 6.54s =========================

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. → [New Pipeline/Model] Add JoyAI-Echo multi-shot audio-video generation pipeline #13909
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Pipelines and pipeline callbacks: @yiyixuxu and @asomoza
Models: @dg845 and @a-r-r-o-w

Add JoyAI-Echo, a unified framework for long-form audio-visual generation that supports minute-level multi-shot video creation with synchronized audio. New files: - JoyAIEchoTransformer3DModel: audiovisual transformer with memory mask support - JoyAIEchoPipeline: multi-shot pipeline with paired audio-video memory bank - JoyAIEchoOriginalCheckpointPipeline: loading from original checkpoints - Pipeline tests and API documentation Key features: - Minute-level multi-shot stories from prompt sequences - DMD-distilled few-step inference (~7.5x faster) - Joint audio-video generation in one pipeline - Paired cross-modal memory bank for story-level consistency Reference: https://github.com/jd-opensource/JoyAI-Echo Weights: https://huggingface.co/jdopensource/JoyAI-Echo

yiyixuxu · 2026-06-10T21:31:44Z

@askserge can you do a initial review here?

yiyixuxu · 2026-06-10T21:34:48Z

@claude can you do a review here?

github-actions · 2026-06-10T21:35:08Z

Claude finished @yiyixuxu's task in 3m 41s —— View job

Review of PR #13910: [feat] Add JoyAI-Echo multi-shot audio-video generation pipeline

Read repository review guidelines (AGENTS.md, models.md, pipelines.md, modular.md)
Read transformer model code
Read pipeline code
Read original checkpoint pipeline code
Read pipeline output and init files
Read test code
Trace call paths and check for dead code
Post comprehensive review

Blocking Issues

1. `JoyAIEchoTransformer3DModel` is a near-complete copy of its parent — and never exercised

transformer_joyai_echo.py subclasses LTX2VideoTransformer3DModel and overrides the entire forward() method (~200 lines) to add three mask parameters (audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask). Two problems:

The pipeline never passes any of these masks. In _denoise_shot (line 380–405), the transformer is called without audio_self_attention_mask, a2v_cross_attention_mask, or v2a_cross_attention_mask — they all default to None. The subclass is dead code. Fix this →
Fragile duplication. The entire parent forward method is copy-pasted. Any future change to LTX2VideoTransformer3DModel.forward() will silently diverge. The better approach is to add these three parameters to the parent's forward() signature (the transformer blocks already accept them — the parent just hardcodes None), removing the need for a subclass entirely.

Per AGENTS.md: "No defensive code, unused code paths, or legacy stubs."

2. `JoyAIEchoOriginalCheckpointPipeline` should not be merged

This pipeline (pipeline_joyai_echo_original_checkpoint.py):

Manipulates sys.path to import from an external repo checkout (ltx-core/src, ltx-pipelines/src, ltx-distillation/src) — this breaks when the external repo isn't present
Imports non-diffusers packages (ltx_distillation.*, torchaudio) without is_X_available() guards
Writes files to disk in __call__ (JSON metadata, MP4, WAV) — diffusers pipelines return outputs, they don't write files
Hardcodes developer paths and imports from the reference repo — per review rules, this is a parity testing harness, not a pipeline

This file belongs in scripts/ or the upstream repository, not in src/diffusers/.

3. Pipeline bypasses the scheduler

JoyAIEchoPipeline.__call__ implements its own denoising loop with explicit sigma scheduling (line 353–432), never calling self.scheduler.set_timesteps() or self.scheduler.step(). Per pipelines.md gotcha #3:

"Don't do this — bypassing the scheduler entirely and rolling your own step"

The scheduler is listed as _optional_components but is never used — it's dead code. The flow matching Euler step logic in _denoise_shot (velocity-to-x0 conversion at line 410–411, noise addition at line 418–419) duplicates what FlowMatchEulerDiscreteScheduler.step() already does. Either integrate with the scheduler or remove it entirely.

4. `_build_video_memory_attention_mask` is defined but never called

pipeline_joyai_echo.py:206 defines _build_video_memory_attention_mask, but it's never invoked anywhere. In _denoise_shot, video_attention_mask is initialized as None and stays None through the entire loop. If memory masks are needed for correctness, they should be built and passed; if not, the method should be removed.

5. Missing `check_inputs`

The pipeline's __call__ performs no input validation — no height/width divisibility check, no prompt type validation, no validation of denoising_sigmas shape beyond 1D check. Other diffusers pipelines (including the parent LTX2Pipeline) validate inputs thoroughly. At minimum, validate height/width and prompt type.

Non-blocking Issues

6. Mutable default argument

# pipeline_joyai_echo.py:457
callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"]

Mutable default arguments are a well-known Python antipattern. Use None and set the default inside the method body.

7. `batch_size = 1` is hardcoded

In __call__ (line 497), batch_size is always 1, meaning the pipeline doesn't support batch generation. This should be documented in the docstring, or batch support should be added.

8. Custom `_get_execution_device` is unnecessary

The override at line 253–266 duplicates what DiffusionPipeline._execution_device already provides. The try/except is odd — _execution_device is a property that always exists. Remove this override and use self._execution_device directly.

9. `JoyAIEchoMemoryBank` and `JoyAIEchoMemorySlot` export inconsistency

These classes are exported from pipelines/joyai_echo/__init__.py but NOT from diffusers/pipelines/__init__.py or diffusers/__init__.py. Either export them consistently or keep them private (they're internal implementation details).

10. Missing `# Copied from` annotations

_decode_latents borrows heavily from LTX2Pipeline decode logic but has no # Copied from annotation. Per pipelines.md gotcha #5, this means future refactors to the source will diverge silently.

11. Test coverage is minimal

Only one test (test_multishot_memory_prefix) exists, checking token counts and output shapes. No single-shot test, no error case tests, no test for output_type="latent". The test uses fake components that bypass actual computation — while this is fine for a fast test, there should be more coverage.

Suggestions / Additional Info: Dead Code Trace

I traced the call path from JoyAIEchoPipeline.__call__ → _denoise_shot → self.transformer(...) and identified the following likely dead code:

Code	Status	Reason
`JoyAIEchoTransformer3DModel.forward()` mask params	Dead	Pipeline never passes `audio_self_attention_mask`, `a2v_cross_attention_mask`, `v2a_cross_attention_mask`
`_build_video_memory_attention_mask`	Dead	Defined but never called anywhere
`self.scheduler`	Dead	Listed as optional component, never used in `__call__`
`_add_flow_noise`	Used	Called in `_denoise_shot`
`_repeat_token_timestep`	Used	Called in `_denoise_shot`
`_select_memory_video_tokens`	Used	Called in `__call__`
`_select_memory_audio_tokens`	Used	Called in `__call__`
`JoyAIEchoOriginalCheckpointPipeline`	Should remove	Parity testing harness, not a diffusers pipeline

Note: The transformer config could in theory differ from defaults, but under the default call path from this pipeline, none of the memory mask parameters are used.

Summary

The core idea of a multi-shot memory-conditioned audio-video pipeline is sound and aligns well with the LTX2 ecosystem. However, several structural issues need attention before merge:

The transformer subclass duplicates the parent's forward and is never exercised — either wire up the masks or remove the subclass
The original checkpoint pipeline is a parity harness, not a diffusers pipeline — remove it
The scheduler is bypassed — integrate with the scheduler or remove it
Add input validation (check_inputs)
Remove dead code (_build_video_memory_attention_mask)

sergereview

🤗 Serge says:

Initial Review — JoyAI-Echo Pipeline

This PR has several significant structural and correctness issues that need to be addressed before it can be merged.

Architecture: Pipeline subclassing violates repo conventions

JoyAIEchoPipeline subclasses LTX2Pipeline — this directly violates the convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class." The pipeline should be a standalone class that reuses shared utilities (prep functions, etc.) but not the pipeline class itself.

Correctness: Scheduler is bypassed entirely

_denoise_shot rolls its own Euler step via _add_flow_noise and never calls self.scheduler.step(). This is exactly what pipelines.md gotcha #3 warns against: "Reimplementing what the scheduler already does." The scheduler is registered as an optional component but never used — the denoising loop should use the scheduler.

Correctness: Memory attention masks are never built or passed

The entire justification for JoyAIEchoTransformer3DModel (subclassing LTX2VideoTransformer3DModel) is to pass through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask. However:

The pipeline never passes these three mask arguments to the transformer.
_build_video_memory_attention_mask is defined but never called — it's dead code.
video_attention_mask in _denoise_shot is always None (set on line 361, never reassigned).

This means the transformer subclass adds no value over the parent, and the memory bank only contributes latent tokens (via concatenation) without any structural attention masking. Either the masks need to be built and passed, or the subclass should be removed.

Correctness: Transformer subclass is a near-verbatim copy of the parent

The 287-line JoyAIEchoTransformer3DModel.forward() is a copy-paste of LTX2VideoTransformer3DModel.forward() with the only difference being that it passes through the mask parameters instead of hardcoding None. Per AGENTS.md: "No defensive code, unused code paths". If the masks are needed, a cleaner approach would be to make the parent class accept them (it already does at the block level). If they're not needed, delete the subclass.

`JoyAIEchoOriginalCheckpointPipeline` should not be shipped

This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context that shouldn't be in the final PR.

Dead code and unnecessary overrides

_get_execution_device() is redundant — _execution_device is a @property on DiffusionPipeline and will never raise AttributeError. The entire method body after return self._execution_device is dead code.
_build_video_memory_attention_mask is never called.
_as_prompt_list is a trivial one-liner that could be inlined.

Tests

Only one test (test_multishot_memory_prefix) with dummy components that bypass all real logic. The test doesn't exercise the actual JoyAIEchoTransformer3DModel at all.
tests/pipelines/joyai_echo/__init__.py has imports and __all__ — test __init__.py files in this repo are empty.
No check_inputs validation method on the pipeline.

Missing model class attributes

While JoyAIEchoTransformer3DModel inherits _no_split_modules, _repeated_blocks, etc. from the parent, it should be verified that these are correct for the subclass (e.g., _repeated_blocks references LTX2VideoTransformerBlock which is fine since the subclass doesn't add new block types).

Suggestions / Dead Code Analysis

Under the default config and the pipeline's call path:

_build_video_memory_attention_mask is unreachable (never called).
The audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask parameters in the transformer's forward are always None because the pipeline never passes them.
The video_attention_mask variable in _denoise_shot is always None.

model: claude-opus-4-6 · 35 LLM turns · 42 tool calls · 226.0s · 2117185 in / 8068 out tokens

sergereview · 2026-06-10T21:35:50Z

+    return audio_latents[:, start:end].contiguous(), audio_coords[:, :, start:end].contiguous()
+
+
+class JoyAIEchoPipeline(LTX2Pipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):


Subclassing LTX2Pipeline violates the repo convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class, even if it shares 90% of __call__ with a sibling."

This should be a standalone pipeline class inheriting from DiffusionPipeline (+ mixins), reusing shared utilities but not the pipeline class itself.

sergereview · 2026-06-10T21:35:50Z

+                    audio_noise = randn_tensor(
+                        audio_latents.shape, generator=generator, device=device, dtype=audio_latents.dtype
+                    )
+                    latents = self._add_flow_noise(pred_video, video_noise, next_sigma).to(dtype=dtype)


The denoising loop bypasses the scheduler entirely and rolls its own Euler step via _add_flow_noise. This is exactly what pipelines.md gotcha #3 warns against:

"don't do this — bypassing the scheduler entirely and rolling your own step"

The scheduler should own the step logic. If the DMD sigma schedule requires a specific scheduler configuration, configure the scheduler accordingly rather than reimplementing the step.

sergereview · 2026-06-10T21:35:50Z

+                audio_model_input = audio_latents
+                video_model_coords = video_coords
+                audio_model_coords = audio_coords
+                video_attention_mask = None


video_attention_mask is set to None here and never reassigned, even when memory tokens are prepended. This means video_self_attention_mask is always None in the transformer call (line 403). The _build_video_memory_attention_mask method (line 206) is defined but never called — it's dead code.

Without attention masks, memory tokens attend freely to target tokens and vice versa with no structural constraint. Is this intentional? If so, the _build_video_memory_attention_mask method and the JoyAIEchoTransformer3DModel subclass (which exists solely to pass through mask parameters) should be removed.

sergereview · 2026-06-10T21:35:50Z

+logger = logging.get_logger(__name__)
+
+
+class JoyAIEchoTransformer3DModel(LTX2VideoTransformer3DModel):


This 287-line forward is a near-verbatim copy of LTX2VideoTransformer3DModel.forward(). The only difference is passing through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask instead of hardcoding None. But the pipeline never passes these masks — they're always None.

Since the block-level forward already accepts these parameters, a cleaner approach (if the masks are actually needed) would be to modify the parent's forward to accept and pass them through, rather than duplicating the entire method. If the masks aren't needed, this subclass should be deleted entirely and the pipeline should use LTX2VideoTransformer3DModel directly.

sergereview · 2026-06-10T21:35:50Z

+from ..pipeline_utils import DiffusionPipeline
+
+
+class JoyAIEchoOriginalCheckpointPipeline(DiffusionPipeline):


This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per the review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context and should not be shipped in the final PR. This file should be removed.

sergereview · 2026-06-10T21:35:50Z

+
+        return self.connectors(prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side)
+
+    def _get_execution_device(self) -> torch.device:


This method is unnecessary. _execution_device is a @property on DiffusionPipeline — it will never raise AttributeError, so the except branch and the fallback loop are dead code. The entire method reduces to return self._execution_device. Just use self._execution_device directly at the call site (line 496) and delete this method.

sergereview · 2026-06-10T21:35:50Z

+        return sigma
+
+    @staticmethod
+    def _build_video_memory_attention_mask(


Dead code — this method is defined but never called anywhere in the pipeline.

sergereview · 2026-06-10T21:35:50Z

@@ -0,0 +1,4 @@
+from .test_joyai_echo import JoyAIEchoPipelineFastTests


Test __init__.py files in this repo are empty (see tests/pipelines/ltx2/__init__.py and others). Remove the imports and __all__.

Suggested change

from .test_joyai_echo import JoyAIEchoPipelineFastTests

sergereview · 2026-06-10T21:35:50Z

+        transformer_outputs_x0: bool = True,
+        attention_kwargs: dict[str, Any] | None = None,
+        callback_on_step_end: Callable[[Any, int, torch.Tensor, dict], dict] | None = None,
+        callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"],


Mutable default argument. This should be None with a default set inside the method body:

Suggested change

callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"],

callback_on_step_end_tensor_inputs: list[str] | None = None,

Then inside the method: callback_on_step_end_tensor_inputs = callback_on_step_end_tensor_inputs or ["latents", "audio_latents"]

sergereview · 2026-06-10T21:35:50Z

+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for name in callback_on_step_end_tensor_inputs or []:
+                        callback_kwargs[name] = locals()[name]


Using locals() to populate callback kwargs is fragile — if a variable is renamed or shadowed, this silently breaks. Prefer explicit dict construction:

callback_kwargs = {} for name in callback_on_step_end_tensor_inputs or []: if name == "latents": callback_kwargs[name] = latents elif name == "audio_latents": callback_kwargs[name] = audio_latents elif name == "prompt_embeds": callback_kwargs[name] = prompt_embeds

github-actions Bot added fixes-issue documentation Improvements or additions to documentation models tests pipelines size/L PR with diff > 200 LOC and removed fixes-issue labels Jun 10, 2026

sergereview Bot suggested changes Jun 10, 2026

View reviewed changes

		return audio_latents[:, start:end].contiguous(), audio_coords[:, :, start:end].contiguous()


		class JoyAIEchoPipeline(LTX2Pipeline, FromSingleFileMixin, LTX2LoraLoaderMixin):

		logger = logging.get_logger(__name__)


		class JoyAIEchoTransformer3DModel(LTX2VideoTransformer3DModel):

		from ..pipeline_utils import DiffusionPipeline


		class JoyAIEchoOriginalCheckpointPipeline(DiffusionPipeline):


		return self.connectors(prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side)

		def _get_execution_device(self) -> torch.device:

		@@ -0,0 +1,4 @@
		from .test_joyai_echo import JoyAIEchoPipelineFastTests

	callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"],
	callback_on_step_end_tensor_inputs: list[str] \| None = None,

Conversation

sjq66 commented Jun 10, 2026

What does this PR do?

Model Overview

Key Features

Implementation Details

New files added:

Usage example:

Tests

Before submitting

Who can review?

Uh oh!

yiyixuxu commented Jun 10, 2026

Uh oh!

yiyixuxu commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #13910: [feat] Add JoyAI-Echo multi-shot audio-video generation pipeline

Blocking Issues

1. JoyAIEchoTransformer3DModel is a near-complete copy of its parent — and never exercised

2. JoyAIEchoOriginalCheckpointPipeline should not be merged

3. Pipeline bypasses the scheduler

4. _build_video_memory_attention_mask is defined but never called

5. Missing check_inputs

Non-blocking Issues

6. Mutable default argument

7. batch_size = 1 is hardcoded

8. Custom _get_execution_device is unnecessary

9. JoyAIEchoMemoryBank and JoyAIEchoMemorySlot export inconsistency

10. Missing # Copied from annotations

11. Test coverage is minimal

Suggestions / Additional Info: Dead Code Trace

Summary

Uh oh!

sergereview Bot left a comment

Choose a reason for hiding this comment

Initial Review — JoyAI-Echo Pipeline

Architecture: Pipeline subclassing violates repo conventions

Correctness: Scheduler is bypassed entirely

Correctness: Memory attention masks are never built or passed

Correctness: Transformer subclass is a near-verbatim copy of the parent

JoyAIEchoOriginalCheckpointPipeline should not be shipped

Dead code and unnecessary overrides

Tests

Missing model class attributes

Suggestions / Dead Code Analysis

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

github-actions Bot commented Jun 10, 2026 •

edited

Loading

1. `JoyAIEchoTransformer3DModel` is a near-complete copy of its parent — and never exercised

2. `JoyAIEchoOriginalCheckpointPipeline` should not be merged

4. `_build_video_memory_attention_mask` is defined but never called

5. Missing `check_inputs`

7. `batch_size = 1` is hardcoded

8. Custom `_get_execution_device` is unnecessary

9. `JoyAIEchoMemoryBank` and `JoyAIEchoMemorySlot` export inconsistency

10. Missing `# Copied from` annotations

`JoyAIEchoOriginalCheckpointPipeline` should not be shipped