Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Optimize HunyuanVideo Inference Speed with ParaAttention #10383

Open
chengzeyi opened this issue Dec 25, 2024 · 5 comments
Open

[Request] Optimize HunyuanVideo Inference Speed with ParaAttention #10383

chengzeyi opened this issue Dec 25, 2024 · 5 comments

Comments

@chengzeyi
Copy link
Contributor

chengzeyi commented Dec 25, 2024

Hi guys,

First and foremost, I would like to commend you for the incredible work on the diffusers library. It has been an invaluable resource for my projects.

I am writing to suggest an enhancement to the inference speed of the HunyuanVideo model. We have found that using ParaAttention can significantly speed up the inference of HunyuanVideo. ParaAttention provides context parallel attention that works with torch.compile, supporting Ulysses Style and Ring Style parallelism. I hope we could add a doc or introduction of how to make HunyuanVideo of diffusers run faster with ParaAttention. Besides HunyuanVideo, FLUX, Mochi and CogVideoX are also supported.

Steps to Optimize HunyuanVideo Inference with ParaAttention:

Install ParaAttention:

pip3 install para-attn
# Or visit https://github.com/chengzeyi/ParaAttention.git to see detailed instructions

Example Script:

Here is an example script to run HunyuanVideo with ParaAttention:

import torch
import torch.distributed as dist
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

dist.init_process_group()

# [rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
torch.backends.cuda.enable_cudnn_sdp(False)

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
    revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.float16,
    revision="refs/pr/18",
).to(f"cuda:{dist.get_rank()}")

pipe.vae.enable_tiling(
    # Make it runnable on GPUs with 48GB memory
    # tile_sample_min_height=128,
    # tile_sample_stride_height=96,
    # tile_sample_min_width=128,
    # tile_sample_stride_width=96,
    # tile_sample_min_num_frames=32,
    # tile_sample_stride_num_frames=24,
)

from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae

mesh = init_context_parallel_mesh(
    pipe.device.type,
)
parallelize_pipe(
    pipe,
    mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())

# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())

# torch._inductor.config.reorder_for_compute_comm_overlap = True
# pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")

output = pipe(
    prompt="A cat walks on the grass, realistic",
    height=720,
    width=1280,
    num_frames=129,
    num_inference_steps=30,
    output_type="pil" if dist.get_rank() == 0 else "pt",
).frames[0]

if dist.get_rank() == 0:
    print("Saving video to hunyuan_video.mp4")
    export_to_video(output, "hunyuan_video.mp4", fps=15)

dist.destroy_process_group()

Save the above code to run_hunyuan_video.py and run it with torchrun:

torchrun --nproc_per_node=2 run_hunyuan_video.py

The generated video on 2xH100:

hunyuan_video.mp4

By following these steps, users can leverage ParaAttention to achieve faster inference times with HunyuanVideo on multiple GPUs.

Thank you for considering this suggestion. I believe it could greatly benefit the community and enhance the performance of HunyuanVideo. Please let me know if there are any questions or further clarifications needed.

@chengzeyi chengzeyi changed the title Optimize HunyuanVideo Inference Speed with ParaAttention [Request] Optimize HunyuanVideo Inference Speed with ParaAttention Dec 25, 2024
@a-r-r-o-w
Copy link
Member

Thanks for the kind words @chengzeyi ☺️

Great work building this, and it would be really cool to mention ParaAttention (similar to how we have a dedicated doc page xDiT, DeepCache, and others). Apart from it being an extremely fast inference solution, it is a very valuable educational resource due to the simplicity of the implementations (I've personally learnt a lot from the codebase atleast so tysm).

cc @yiyixuxu @stevhliu

@stevhliu
Copy link
Member

As @a-r-r-o-w mentioned, feel free to create a dedicated doc page for ParaAttention and you can ping me on it! 🤗

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 2, 2025

@chengzeyi thanks for opening the issue!
agree with @stevhliu and @a-r-r-o-w, we would love to promote ParaAttention with diffusers

Would you be willing to open a PR for the doc page? it is best if it's coming from the author. If not, let us know. we can help as well

@chengzeyi
Copy link
Contributor Author

@chengzeyi thanks for opening the issue! agree with @stevhliu and @a-r-r-o-w, we would love to promote ParaAttention with diffusers

Would you be willing to open a PR for the doc page? it is best if it's coming from the author. If not, let us know. we can help as well

Thank you! I am currently writing a doc about how to optimize video model inference. This could be a good start.

@chengzeyi
Copy link
Contributor Author

@stevhliu @yiyixuxu

https://github.com/chengzeyi/ParaAttention/blob/main/doc/fastest_hunyuan_video.md

I wrote a tutorial about how to optimize HunyuanVideo inference with ParaAttention.
Hope to add more details and more models like FLUX!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants