Skip to content

Commit 8332ece

Browse files
authored
Merge pull request #1034 from modelscope/video_as_prompt
Video as prompt
1 parent 401d7d7 commit 8332ece

File tree

13 files changed

+635
-16
lines changed

13 files changed

+635
-16
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ save_video(video, "video1.mp4", fps=15, quality=5)
237237
|[DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1](https://modelscope.cn/models/DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1)|`motion_bucket_id`|[code](./examples/wanvideo/model_inference/Wan2.1-1.3b-speedcontrol-v1.py)|[code](./examples/wanvideo/model_training/full/Wan2.1-1.3b-speedcontrol-v1.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.1-1.3b-speedcontrol-v1.py)|[code](./examples/wanvideo/model_training/lora/Wan2.1-1.3b-speedcontrol-v1.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.1-1.3b-speedcontrol-v1.py)|
238238
|[krea/krea-realtime-video](https://www.modelscope.cn/models/krea/krea-realtime-video)||[code](./examples/wanvideo/model_inference/krea-realtime-video.py)|[code](./examples/wanvideo/model_training/full/krea-realtime-video.sh)|[code](./examples/wanvideo/model_training/validate_full/krea-realtime-video.py)|[code](./examples/wanvideo/model_training/lora/krea-realtime-video.sh)|[code](./examples/wanvideo/model_training/validate_lora/krea-realtime-video.py)|
239239
|[meituan-longcat/LongCat-Video](https://www.modelscope.cn/models/meituan-longcat/LongCat-Video)|`longcat_video`|[code](./examples/wanvideo/model_inference/LongCat-Video.py)|[code](./examples/wanvideo/model_training/full/LongCat-Video.sh)|[code](./examples/wanvideo/model_training/validate_full/LongCat-Video.py)|[code](./examples/wanvideo/model_training/lora/LongCat-Video.sh)|[code](./examples/wanvideo/model_training/validate_lora/LongCat-Video.py)|
240+
|[ByteDance/Video-As-Prompt-Wan2.1-14B](https://modelscope.cn/models/ByteDance/Video-As-Prompt-Wan2.1-14B)|`vap_video`, `vap_prompt`|[code](./examples/wanvideo/model_inference/Video-As-Prompt-Wan2.1-14B.py)|[code](./examples/wanvideo/model_training/full/Video-As-Prompt-Wan2.1-14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Video-As-Prompt-Wan2.1-14B.py)|[code](./examples/wanvideo/model_training/lora/Video-As-Prompt-Wan2.1-14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Video-As-Prompt-Wan2.1-14B.py)|
240241

241242
</details>
242243

@@ -387,6 +388,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
387388

388389
## Update History
389390

391+
- **November 4, 2025**: We support [ByteDance/Video-As-Prompt-Wan2.1-14B](https://modelscope.cn/models/ByteDance/Video-As-Prompt-Wan2.1-14B) model, which is trained on Wan 2.1 and enables motion generation conditioned on reference videos.
392+
390393
- **October 30, 2025**: We support [meituan-longcat/LongCat-Video](https://www.modelscope.cn/models/meituan-longcat/LongCat-Video) model, which enables text-to-video, image-to-video, and video continuation capabilities. This model adopts Wan's framework for both inference and training in this project.
391394

392395
- **October 27, 2025**: We support [krea/krea-realtime-video](https://www.modelscope.cn/models/krea/krea-realtime-video) model, further expanding Wan's ecosystem.

README_zh.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,7 @@ save_video(video, "video1.mp4", fps=15, quality=5)
237237
|[DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1](https://modelscope.cn/models/DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1)|`motion_bucket_id`|[code](./examples/wanvideo/model_inference/Wan2.1-1.3b-speedcontrol-v1.py)|[code](./examples/wanvideo/model_training/full/Wan2.1-1.3b-speedcontrol-v1.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.1-1.3b-speedcontrol-v1.py)|[code](./examples/wanvideo/model_training/lora/Wan2.1-1.3b-speedcontrol-v1.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.1-1.3b-speedcontrol-v1.py)|
238238
|[krea/krea-realtime-video](https://www.modelscope.cn/models/krea/krea-realtime-video)||[code](./examples/wanvideo/model_inference/krea-realtime-video.py)|[code](./examples/wanvideo/model_training/full/krea-realtime-video.sh)|[code](./examples/wanvideo/model_training/validate_full/krea-realtime-video.py)|[code](./examples/wanvideo/model_training/lora/krea-realtime-video.sh)|[code](./examples/wanvideo/model_training/validate_lora/krea-realtime-video.py)|
239239
|[meituan-longcat/LongCat-Video](https://www.modelscope.cn/models/meituan-longcat/LongCat-Video)|`longcat_video`|[code](./examples/wanvideo/model_inference/LongCat-Video.py)|[code](./examples/wanvideo/model_training/full/LongCat-Video.sh)|[code](./examples/wanvideo/model_training/validate_full/LongCat-Video.py)|[code](./examples/wanvideo/model_training/lora/LongCat-Video.sh)|[code](./examples/wanvideo/model_training/validate_lora/LongCat-Video.py)|
240+
|[ByteDance/Video-As-Prompt-Wan2.1-14B](https://modelscope.cn/models/ByteDance/Video-As-Prompt-Wan2.1-14B)|`vap_video`, `vap_prompt`|[code](./examples/wanvideo/model_inference/Video-As-Prompt-Wan2.1-14B.py)|[code](./examples/wanvideo/model_training/full/Video-As-Prompt-Wan2.1-14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Video-As-Prompt-Wan2.1-14B.py)|[code](./examples/wanvideo/model_training/lora/Video-As-Prompt-Wan2.1-14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Video-As-Prompt-Wan2.1-14B.py)|
240241

241242
</details>
242243

@@ -403,6 +404,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
403404

404405
## 更新历史
405406

407+
- **2025年11月4日** 支持了 [ByteDance/Video-As-Prompt-Wan2.1-14B](https://modelscope.cn/models/ByteDance/Video-As-Prompt-Wan2.1-14B) 模型,该模型基于 Wan 2.1 训练,支持根据参考视频生成相应的动作。
408+
406409
- **2025年10月30日** 支持了 [meituan-longcat/LongCat-Video](https://www.modelscope.cn/models/meituan-longcat/LongCat-Video) 模型,该模型支持文生视频、图生视频、视频续写。这个模型在本项目中沿用 Wan 的框架进行推理和训练。
407410

408411
- **2025年10月27日** 支持了 [krea/krea-realtime-video](https://www.modelscope.cn/models/krea/krea-realtime-video) 模型,Wan 模型生态再添一员。

diffsynth/configs/model_config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
from ..models.wan_video_vace import VaceWanModel
6565
from ..models.wav2vec import WanS2VAudioEncoder
6666
from ..models.wan_video_animate_adapter import WanAnimateAdapter
67+
from ..models.wan_video_mot import MotWanModel
6768

6869
from ..models.step1x_connector import Qwen2Connector
6970

@@ -157,6 +158,7 @@
157158
(None, "2267d489f0ceb9f21836532952852ee5", ["wan_video_dit"], [WanModel], "civitai"),
158159
(None, "5ec04e02b42d2580483ad69f4e76346a", ["wan_video_dit"], [WanModel], "civitai"),
159160
(None, "47dbeab5e560db3180adf51dc0232fb1", ["wan_video_dit"], [WanModel], "civitai"),
161+
(None, "5f90e66a0672219f12d9a626c8c21f61", ["wan_video_dit", "wan_video_vap"], [WanModel,MotWanModel], "diffusers"),
160162
(None, "a61453409b67cd3246cf0c3bebad47ba", ["wan_video_dit", "wan_video_vace"], [WanModel, VaceWanModel], "civitai"),
161163
(None, "7a513e1f257a861512b1afd387a8ecd9", ["wan_video_dit", "wan_video_vace"], [WanModel, VaceWanModel], "civitai"),
162164
(None, "cb104773c6c2cb6df4f9529ad5c60d0b", ["wan_video_dit"], [WanModel], "diffusers"),

diffsynth/models/wan_video_dit.py

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -437,6 +437,11 @@ def from_diffusers(self, state_dict):
437437
"blocks.0.attn2.to_q.weight": "blocks.0.cross_attn.q.weight",
438438
"blocks.0.attn2.to_v.bias": "blocks.0.cross_attn.v.bias",
439439
"blocks.0.attn2.to_v.weight": "blocks.0.cross_attn.v.weight",
440+
"blocks.0.attn2.add_k_proj.bias":"blocks.0.cross_attn.k_img.bias",
441+
"blocks.0.attn2.add_k_proj.weight":"blocks.0.cross_attn.k_img.weight",
442+
"blocks.0.attn2.add_v_proj.bias":"blocks.0.cross_attn.v_img.bias",
443+
"blocks.0.attn2.add_v_proj.weight":"blocks.0.cross_attn.v_img.weight",
444+
"blocks.0.attn2.norm_added_k.weight":"blocks.0.cross_attn.norm_k_img.weight",
440445
"blocks.0.ffn.net.0.proj.bias": "blocks.0.ffn.0.bias",
441446
"blocks.0.ffn.net.0.proj.weight": "blocks.0.ffn.0.weight",
442447
"blocks.0.ffn.net.2.bias": "blocks.0.ffn.2.bias",
@@ -454,6 +459,14 @@ def from_diffusers(self, state_dict):
454459
"condition_embedder.time_embedder.linear_2.weight": "time_embedding.2.weight",
455460
"condition_embedder.time_proj.bias": "time_projection.1.bias",
456461
"condition_embedder.time_proj.weight": "time_projection.1.weight",
462+
"condition_embedder.image_embedder.ff.net.0.proj.bias":"img_emb.proj.1.bias",
463+
"condition_embedder.image_embedder.ff.net.0.proj.weight":"img_emb.proj.1.weight",
464+
"condition_embedder.image_embedder.ff.net.2.bias":"img_emb.proj.3.bias",
465+
"condition_embedder.image_embedder.ff.net.2.weight":"img_emb.proj.3.weight",
466+
"condition_embedder.image_embedder.norm1.bias":"img_emb.proj.0.bias",
467+
"condition_embedder.image_embedder.norm1.weight":"img_emb.proj.0.weight",
468+
"condition_embedder.image_embedder.norm2.bias":"img_emb.proj.4.bias",
469+
"condition_embedder.image_embedder.norm2.weight":"img_emb.proj.4.weight",
457470
"patch_embedding.bias": "patch_embedding.bias",
458471
"patch_embedding.weight": "patch_embedding.weight",
459472
"scale_shift_table": "head.modulation",
@@ -470,7 +483,7 @@ def from_diffusers(self, state_dict):
470483
name_ = rename_dict[name_]
471484
name_ = ".".join(name_.split(".")[:1] + [name.split(".")[1]] + name_.split(".")[2:])
472485
state_dict_[name_] = param
473-
if hash_state_dict_keys(state_dict) == "cb104773c6c2cb6df4f9529ad5c60d0b":
486+
if hash_state_dict_keys(state_dict_) == "cb104773c6c2cb6df4f9529ad5c60d0b":
474487
config = {
475488
"model_type": "t2v",
476489
"patch_size": (1, 2, 2),
@@ -488,6 +501,20 @@ def from_diffusers(self, state_dict):
488501
"cross_attn_norm": True,
489502
"eps": 1e-6,
490503
}
504+
elif hash_state_dict_keys(state_dict_) == "6bfcfb3b342cb286ce886889d519a77e":
505+
config = {
506+
"has_image_input": True,
507+
"patch_size": [1, 2, 2],
508+
"in_dim": 36,
509+
"dim": 5120,
510+
"ffn_dim": 13824,
511+
"freq_dim": 256,
512+
"text_dim": 4096,
513+
"out_dim": 16,
514+
"num_heads": 40,
515+
"num_layers": 40,
516+
"eps": 1e-6
517+
}
491518
else:
492519
config = {}
493520
return state_dict_, config

0 commit comments

Comments
 (0)