Skip to content

Conversation

@dg845
Copy link
Collaborator

@dg845 dg845 commented Jan 30, 2026

What does this PR do?

This PR improves the diffusers.pipelines.ltx2.export_utils.encode_video function for exporting videos with audio by allowing it to accept input types besides torch.Tensor, such as np.ndarray and List[PIL.Image.Image]. This is meant to make it easier to use as users will not have to do special processing such as

# Pipeline `output_type="np"`
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

before calling encode_video; the PR will handle this logic inside encode_video. (Note that using the above processing will still work after the change, but will no longer be necessary.)

Changelist

  1. Support np.ndarray and List[PIL.Image.Image] video inputs for encode_video, which are assumed to be outputs from pipelines using output_types of "np" and "pil", respectively.
  2. Add a docstring for encode_video.
  3. Update LTX 2 examples to use encode_video without any special processing.
  4. Support chunked video encoding, in line with the official LTX 2 export code at https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@yiyixuxu

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dg845 dg845 requested a review from sayakpaul January 30, 2026 01:40
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some comments.

video_np = video.cpu().numpy()
"""
Encodes a video with audio using the PyAV library. Based on code from the original LTX-2 repo:
https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): let's put a permalink.

The sampling rate of the audio waveform. For LTX 2, this is typically 24000 (24 kHz).
output_path (`str`):
The path to save the encoded video to.
video_chunks_number (`int`, *optional*, defaults to `1`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this option helpful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original LTX-2 code will use a video_chunks_number calculated from the video VAE tiling config, for example in two stage inference:

https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L257

For the default num_frames value of 121 and default tiling config TilingConfig.default(), I believe this works out to 3 chunks. The idea seems to be that the chunks correspond to each tiled stride when decoding.

In practice, I haven't had any issues with the current code, which is equivalent to just using one chunk. I don't fully understand the reasoning behind why the original code supports it; my guess is that it is useful for very long videos or if there are compute constraints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #13057 (comment) for discussion about some complications for supporting video_chunks_number.

video = torch.from_numpy(video)
elif isinstance(video, np.ndarray):
# Pipeline output_type="np"
video = (video * 255).round().astype("uint8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for the value range before doing this? Just to be safe.

for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number):
video_chunk_cpu = video_chunk.to("cpu").numpy()
for frame_array in video_chunk_cpu:
frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let the users control this format? 👀

Copy link
Collaborator Author

@dg845 dg845 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could allow the users to specify the format, but this would be in tension with value checking as suggested in #13057 (comment): for example, if we always convert denormalized inputs with values in $[0, 1]$ to uint8 values in $\{0, 1, \ldots, 255\}$, that would probably make it difficult to support a variety of formats.

We could conditionally convert based on the supplied video_format, but my understanding is that there are a lot of video formats, and I don't think we can anticipate all of the use cases that users may have. So I think we could support a video_format argument with a "use at your own risk" caveat:

    elif isinstance(video, np.ndarray):
        # Pipeline output_type="np"
        is_denormalized = np.logical_and(np.zeros_like(video) <= video, video <= np.ones_like(video))
        if np.all(is_denormalized) and video_format == "rgb24":
            video = (video * 255).round().astype("uint8")
        else:
            logger.warning(
                f"The video will be encoded using the input `video` values as-is with format {video_format}. Make sure"
                f" the values are in the proper range for the supplied format".
            )
        video = torch.from_numpy(video)

An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.

EDIT: the right terminology here might be "pixel format" rather than "video format".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.

Okay let's go with this.

yield first_chunk
yield from tiles_generator

for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of getting rid of all_tiles() and doing it like so?

from itertools import chain

for video_chunk in tqdm(chain([first_chunk], video), total=video_chunks_number):
    video_chunk_cpu = video_chunk.to("cpu").numpy()
    for frame_array in video_chunk_cpu:
        frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
        for packet in stream.encode(frame):
            container.mux(packet)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does the right thing but appears not to work well with tqdm, which doesn't update properly from the chain object:

 33%|████████████████████████████████▋                                                                 | 1/3 [00:04<00:09,  4.57s/it]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Copy link
Collaborator Author

@dg845 dg845 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think #13057 (comment) is wrong - as we generally supply a single torch.Tensor to encode_video (for example from a pipeline output), this line creates an iterator with one element:

if isinstance(video, torch.Tensor):
video = iter([video])
first_chunk = next(video)

So when we call next(video) in the following line, the video iterator is now exhausted. So even if we set video_num_chunks > 1 in this case, our for loop through first_chunk and video will only yield one element in total, whether that's using all_tiles or chain. Thus, the progress bar will end up being wrong in this case since we tell tqdm that we have video_num_chunks > 1 elements when we in fact only have one.

I think the underlying difference is that the original LTX 2 code will return an iterator over decoded tiles when performing tiled VAE decoding, whereas we will return the whole decoded output as a single tensor with the tiles stitched back together. So maybe it doesn't make sense to support video_chunks_number as this will only work well when we supply an Iterator[torch.Tensor] to encode_video (in the current implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants