[feat] Add JoyAI-Echo multi-shot audio-video generation pipeline#13910
[feat] Add JoyAI-Echo multi-shot audio-video generation pipeline#13910sjq66 wants to merge 1 commit into
Conversation
Add JoyAI-Echo, a unified framework for long-form audio-visual generation that supports minute-level multi-shot video creation with synchronized audio. New files: - JoyAIEchoTransformer3DModel: audiovisual transformer with memory mask support - JoyAIEchoPipeline: multi-shot pipeline with paired audio-video memory bank - JoyAIEchoOriginalCheckpointPipeline: loading from original checkpoints - Pipeline tests and API documentation Key features: - Minute-level multi-shot stories from prompt sequences - DMD-distilled few-step inference (~7.5x faster) - Joint audio-video generation in one pipeline - Paired cross-modal memory bank for story-level consistency Reference: https://github.com/jd-opensource/JoyAI-Echo Weights: https://huggingface.co/jdopensource/JoyAI-Echo
|
@askserge can you do a initial review here? |
|
@claude can you do a review here? |
|
Claude finished @yiyixuxu's task in 3m 41s —— View job Review of PR #13910: [feat] Add JoyAI-Echo multi-shot audio-video generation pipeline
Blocking Issues1.
|
| Code | Status | Reason |
|---|---|---|
JoyAIEchoTransformer3DModel.forward() mask params |
Dead | Pipeline never passes audio_self_attention_mask, a2v_cross_attention_mask, v2a_cross_attention_mask |
_build_video_memory_attention_mask |
Dead | Defined but never called anywhere |
self.scheduler |
Dead | Listed as optional component, never used in __call__ |
_add_flow_noise |
Used | Called in _denoise_shot |
_repeat_token_timestep |
Used | Called in _denoise_shot |
_select_memory_video_tokens |
Used | Called in __call__ |
_select_memory_audio_tokens |
Used | Called in __call__ |
JoyAIEchoOriginalCheckpointPipeline |
Should remove | Parity testing harness, not a diffusers pipeline |
Note: The transformer config could in theory differ from defaults, but under the default call path from this pipeline, none of the memory mask parameters are used.
Summary
The core idea of a multi-shot memory-conditioned audio-video pipeline is sound and aligns well with the LTX2 ecosystem. However, several structural issues need attention before merge:
- The transformer subclass duplicates the parent's
forwardand is never exercised — either wire up the masks or remove the subclass - The original checkpoint pipeline is a parity harness, not a diffusers pipeline — remove it
- The scheduler is bypassed — integrate with the scheduler or remove it
- Add input validation (
check_inputs) - Remove dead code (
_build_video_memory_attention_mask)
There was a problem hiding this comment.
🤗 Serge says:
Initial Review — JoyAI-Echo Pipeline
This PR has several significant structural and correctness issues that need to be addressed before it can be merged.
Architecture: Pipeline subclassing violates repo conventions
JoyAIEchoPipeline subclasses LTX2Pipeline — this directly violates the convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class." The pipeline should be a standalone class that reuses shared utilities (prep functions, etc.) but not the pipeline class itself.
Correctness: Scheduler is bypassed entirely
_denoise_shot rolls its own Euler step via _add_flow_noise and never calls self.scheduler.step(). This is exactly what pipelines.md gotcha #3 warns against: "Reimplementing what the scheduler already does." The scheduler is registered as an optional component but never used — the denoising loop should use the scheduler.
Correctness: Memory attention masks are never built or passed
The entire justification for JoyAIEchoTransformer3DModel (subclassing LTX2VideoTransformer3DModel) is to pass through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask. However:
- The pipeline never passes these three mask arguments to the transformer.
_build_video_memory_attention_maskis defined but never called — it's dead code.video_attention_maskin_denoise_shotis alwaysNone(set on line 361, never reassigned).
This means the transformer subclass adds no value over the parent, and the memory bank only contributes latent tokens (via concatenation) without any structural attention masking. Either the masks need to be built and passed, or the subclass should be removed.
Correctness: Transformer subclass is a near-verbatim copy of the parent
The 287-line JoyAIEchoTransformer3DModel.forward() is a copy-paste of LTX2VideoTransformer3DModel.forward() with the only difference being that it passes through the mask parameters instead of hardcoding None. Per AGENTS.md: "No defensive code, unused code paths". If the masks are needed, a cleaner approach would be to make the parent class accept them (it already does at the block level). If they're not needed, delete the subclass.
JoyAIEchoOriginalCheckpointPipeline should not be shipped
This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context that shouldn't be in the final PR.
Dead code and unnecessary overrides
_get_execution_device()is redundant —_execution_deviceis a@propertyonDiffusionPipelineand will never raiseAttributeError. The entire method body afterreturn self._execution_deviceis dead code._build_video_memory_attention_maskis never called._as_prompt_listis a trivial one-liner that could be inlined.
Tests
- Only one test (
test_multishot_memory_prefix) with dummy components that bypass all real logic. The test doesn't exercise the actualJoyAIEchoTransformer3DModelat all. tests/pipelines/joyai_echo/__init__.pyhas imports and__all__— test__init__.pyfiles in this repo are empty.- No
check_inputsvalidation method on the pipeline.
Missing model class attributes
While JoyAIEchoTransformer3DModel inherits _no_split_modules, _repeated_blocks, etc. from the parent, it should be verified that these are correct for the subclass (e.g., _repeated_blocks references LTX2VideoTransformerBlock which is fine since the subclass doesn't add new block types).
Suggestions / Dead Code Analysis
Under the default config and the pipeline's call path:
_build_video_memory_attention_maskis unreachable (never called).- The
audio_self_attention_mask,a2v_cross_attention_mask,v2a_cross_attention_maskparameters in the transformer'sforwardare alwaysNonebecause the pipeline never passes them. - The
video_attention_maskvariable in_denoise_shotis alwaysNone.
model: claude-opus-4-6 · 35 LLM turns · 42 tool calls · 226.0s · 2117185 in / 8068 out tokens
| return audio_latents[:, start:end].contiguous(), audio_coords[:, :, start:end].contiguous() | ||
|
|
||
|
|
||
| class JoyAIEchoPipeline(LTX2Pipeline, FromSingleFileMixin, LTX2LoraLoaderMixin): |
There was a problem hiding this comment.
Subclassing LTX2Pipeline violates the repo convention in pipelines.md gotcha #4: "Don't use an existing pipeline class to override another inside the core src/ codebase. Each pipeline lives in its own file with its own class, even if it shares 90% of __call__ with a sibling."
This should be a standalone pipeline class inheriting from DiffusionPipeline (+ mixins), reusing shared utilities but not the pipeline class itself.
| audio_noise = randn_tensor( | ||
| audio_latents.shape, generator=generator, device=device, dtype=audio_latents.dtype | ||
| ) | ||
| latents = self._add_flow_noise(pred_video, video_noise, next_sigma).to(dtype=dtype) |
There was a problem hiding this comment.
The denoising loop bypasses the scheduler entirely and rolls its own Euler step via _add_flow_noise. This is exactly what pipelines.md gotcha #3 warns against:
"don't do this — bypassing the scheduler entirely and rolling your own step"
The scheduler should own the step logic. If the DMD sigma schedule requires a specific scheduler configuration, configure the scheduler accordingly rather than reimplementing the step.
| audio_model_input = audio_latents | ||
| video_model_coords = video_coords | ||
| audio_model_coords = audio_coords | ||
| video_attention_mask = None |
There was a problem hiding this comment.
video_attention_mask is set to None here and never reassigned, even when memory tokens are prepended. This means video_self_attention_mask is always None in the transformer call (line 403). The _build_video_memory_attention_mask method (line 206) is defined but never called — it's dead code.
Without attention masks, memory tokens attend freely to target tokens and vice versa with no structural constraint. Is this intentional? If so, the _build_video_memory_attention_mask method and the JoyAIEchoTransformer3DModel subclass (which exists solely to pass through mask parameters) should be removed.
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class JoyAIEchoTransformer3DModel(LTX2VideoTransformer3DModel): |
There was a problem hiding this comment.
This 287-line forward is a near-verbatim copy of LTX2VideoTransformer3DModel.forward(). The only difference is passing through audio_self_attention_mask, a2v_cross_attention_mask, and v2a_cross_attention_mask instead of hardcoding None. But the pipeline never passes these masks — they're always None.
Since the block-level forward already accepts these parameters, a cleaner approach (if the masks are actually needed) would be to modify the parent's forward to accept and pass them through, rather than duplicating the entire method. If the masks aren't needed, this subclass should be deleted entirely and the pipeline should use LTX2VideoTransformer3DModel directly.
| from ..pipeline_utils import DiffusionPipeline | ||
|
|
||
|
|
||
| class JoyAIEchoOriginalCheckpointPipeline(DiffusionPipeline): |
There was a problem hiding this comment.
This class manipulates sys.path to import from the original research repo (ltx-core/src, ltx-pipelines/src, ltx-distillation/src), has hardcoded developer paths, and is essentially a parity harness. Per the review rules, parity harnesses and comparison scripts with imports from the reference repo are ephemeral context and should not be shipped in the final PR. This file should be removed.
|
|
||
| return self.connectors(prompt_embeds, prompt_attention_mask, padding_side=tokenizer_padding_side) | ||
|
|
||
| def _get_execution_device(self) -> torch.device: |
There was a problem hiding this comment.
This method is unnecessary. _execution_device is a @property on DiffusionPipeline — it will never raise AttributeError, so the except branch and the fallback loop are dead code. The entire method reduces to return self._execution_device. Just use self._execution_device directly at the call site (line 496) and delete this method.
| return sigma | ||
|
|
||
| @staticmethod | ||
| def _build_video_memory_attention_mask( |
There was a problem hiding this comment.
Dead code — this method is defined but never called anywhere in the pipeline.
| @@ -0,0 +1,4 @@ | |||
| from .test_joyai_echo import JoyAIEchoPipelineFastTests | |||
There was a problem hiding this comment.
Test __init__.py files in this repo are empty (see tests/pipelines/ltx2/__init__.py and others). Remove the imports and __all__.
| from .test_joyai_echo import JoyAIEchoPipelineFastTests |
| transformer_outputs_x0: bool = True, | ||
| attention_kwargs: dict[str, Any] | None = None, | ||
| callback_on_step_end: Callable[[Any, int, torch.Tensor, dict], dict] | None = None, | ||
| callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"], |
There was a problem hiding this comment.
Mutable default argument. This should be None with a default set inside the method body:
| callback_on_step_end_tensor_inputs: list[str] = ["latents", "audio_latents"], | |
| callback_on_step_end_tensor_inputs: list[str] | None = None, |
Then inside the method: callback_on_step_end_tensor_inputs = callback_on_step_end_tensor_inputs or ["latents", "audio_latents"]
| if callback_on_step_end is not None: | ||
| callback_kwargs = {} | ||
| for name in callback_on_step_end_tensor_inputs or []: | ||
| callback_kwargs[name] = locals()[name] |
There was a problem hiding this comment.
Using locals() to populate callback kwargs is fragile — if a variable is renamed or shadowed, this silently breaks. Prefer explicit dict construction:
callback_kwargs = {}
for name in callback_on_step_end_tensor_inputs or []:
if name == "latents":
callback_kwargs[name] = latents
elif name == "audio_latents":
callback_kwargs[name] = audio_latents
elif name == "prompt_embeds":
callback_kwargs[name] = prompt_embeds
What does this PR do?
We are the JoyAI Team (JD.com), and this is the Diffusers implementation for the JoyAI-Echo model.
Fixes #13909
Model Overview
JoyAI-Echo is a unified framework for long-form audio-visual generation that supports minute-level multi-shot video creation with synchronized audio, strong temporal consistency, and real-time interaction.
Key Features
Implementation Details
New files added:
src/diffusers/models/transformers/transformer_joyai_echo.py—JoyAIEchoTransformer3DModel, extends LTX2VideoTransformer3DModel with memory mask support for multi-shot generationsrc/diffusers/pipelines/joyai_echo/pipeline_joyai_echo.py—JoyAIEchoPipeline, multi-shot pipeline with paired audio-video memory banksrc/diffusers/pipelines/joyai_echo/pipeline_joyai_echo_original_checkpoint.py—JoyAIEchoOriginalCheckpointPipelinefor loading original checkpointssrc/diffusers/pipelines/joyai_echo/pipeline_output.py— Output dataclassesdocs/source/en/api/pipelines/joyai_echo.md— API documentationtests/pipelines/joyai_echo/test_joyai_echo.py— Pipeline testsUsage example:
Tests
Before submitting
Who can review?