I combined your code with diffusers stable diffusion and trained a model #10

lxj616 · 2022-12-27T09:38:35Z

https://github.com/lxj616/make-a-stable-diffusion-video

Used your Pseudo3DConv and pseudo 3d attention here:

https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/resnet_pseudo3d.py#L8
https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/attention_pseudo3d.py#L432

Thank you for opensource the Pseudo3D code, it seems to be working

hxngiee · 2022-12-27T11:24:23Z

Hi, @lxj616
Thanks for your interesting Make A Stable Diffusion Video
I wonder how to train your Pretrained toy model
Could you explain a little bit more how to trained it?

lxj616 · 2022-12-27T11:49:06Z

@hxngiee I trained the model using examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint.py from thedarkzeno or patil-suraj

And because we are doing video, load the dataset as (b, c, f, h, w) instead of (b, c, h, w), and everything else are taken care of by the original script author, about how to do fp16/accelerate/8-bit adam please view the README of the dreambooth subfolder, they are mostly out of the box usable

If you need more explanations, I could also share my train_dreambooth.py, but I wrote very messy bad code and does not even rename it LOL, lots of hardcoding hacky tricks, I guess you'll end up rewrite the original train_dreambooth_inpaint.py and that's faster than to debug mine

hxngiee · 2022-12-27T13:19:14Z

@lxj616 cool
Is this method fine-tuning approach which applying a model trained with image to video? or Using video model and data from the training stage?
Actually I don't read the dreambooth paper closely

lxj616 · 2022-12-27T14:23:00Z

@hxngiee It's the text2image model with new temporal layers, the text2image model is the stable diffusion, the new layer needs to be trained similar to dreambooth example, since you ask how to train a model in diffusers ...

Not finetuning the text2image model, the backbone is frozen, only training the new layers
And the new layers deal with video only, not optimizing on image generation

maybe you wish to read train_dreambooth_inpaint.py to understand how to train this video model, but don't get the ideas wrong, we are talking make-a-video not dreambooth

hxngiee · 2022-12-27T14:41:11Z

@lxj616 Thank you for your reply. I understand what you did. To make a video, adding temporal consistency layer and train them similar to dreambooth.

Pseudo3DConv and pseudo 3d attention were effective in training video diffusion model.

Thanks for sharing your finding and I will look at the code closely!

lucidrains · 2022-12-27T17:04:15Z

@lxj616 nice! yea, i still need to complete https://github.com/lucidrains/classifier-free-guidance-pytorch , and then integrate this all into dalle2-pytorch

should be all complete in early january. if you cannot wait, try using this

lucidrains · 2022-12-27T17:07:08Z

@lxj616 cat is looking quite majestic 🤣 let's make it move in 2023

Samge0 · 2022-12-29T03:21:19Z

nice~

chavinlo · 2022-12-29T07:08:43Z

Amazing job.

chavinlo · 2022-12-29T07:13:17Z

If you need more explanations, I could also share my train_dreambooth.py

Could it be possible to do so?

lxj616 · 2022-12-29T07:32:36Z

@chavinlo I dropped my messy script at https://gist.github.com/lxj616/5134368f44aca837304530695ee100ea

But I bet it would be quicker if you modify the original train_dreambooth.py from diffusers than to debug mine, I barely make it run on my specific environment, it has 99.9% chance not gonna run on your system LOL

chavinlo · 2022-12-29T19:28:45Z

@chavinlo I dropped my messy script at https://gist.github.com/lxj616/5134368f44aca837304530695ee100ea

But I bet it would be quicker if you modify the original train_dreambooth.py from diffusers than to debug mine, I barely make it run on my specific environment, it has 99.9% chance not gonna run on your system LOL

Thanks. Could it be also possible to release the webdataset making code?
and you just used the CLIPTextModel encoder (from text_encoder folder) to create the txt_embed on the npz right?

chavinlo · 2022-12-29T20:47:21Z

I've read your blog about VRAM limitations. If you need more compute, I can give you an A100 to experiment.

lxj616 · 2022-12-30T02:47:53Z

@chavinlo Thanks for asking but 24GB is enough for testing if I pre-compute the embeddings and save them into webdataset, since I see you got A100 (perhaps 40GB vram), you do not need my webdataset making code, you can just load a video and vae encode it on the fly (which is much more easier to use), my webdataset making is actually done in python interactive shell and did not log a python script because I thought it was one-time thing per dataset, I may need to log everything down on my next attempt ...

chavinlo · 2022-12-30T08:10:48Z

@lxj616 Thanks. One more question, in the preprocess function, you treat npz as if it had all the videos? because it itterates through it, and adds all the frames of npz_i into f8_list, and keeps doing it until there are no more npz_i left? Finally, does example['f8'] contains all the video's frames? or just a single video's frame?

lxj616 · 2022-12-30T09:12:17Z

@chavinlo One npz contains all video frames of one single video, the loop is dealing with a batch, and the final example['f8'] is a batch of video frames with shape (b c f h w), where f is the frame length

chavinlo · 2023-01-05T02:02:48Z

@lxj616 Hello again, I got training working with bs 1 and 25 frames. Although I had to convert the model to BFloat16, because I got OOM with fp32 (80GB+), and loss=nan with fp16. I see that you mentioned that you used fp16 and 8bit. How did you managed to use them? I can't use 8bit with my current setup because it won't work with bf16.

chavinlo · 2023-01-05T02:03:14Z

Also, bf16 uses 44GB, but when using grad checkpointing, it decreases to 11GB

lxj616 · 2023-01-05T14:44:05Z

@chavinlo Hmmm... I never met this problem, I just use the original code and when running accelerate config choose fp16 and everything seems to work, maybe, there were some warnings said [INFO] [stage_1_and_2.py:1769:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 but I don't know what does it mean, BTW I didn't change anything for 8-bit adam, maybe it does not work and I didn't discover ....

chavinlo · 2023-01-13T18:00:14Z

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16

lxj616 · 2023-01-14T02:44:57Z

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16

For 13.73s/it , if you train on webvid-10m, you would need 4.35 years to finish one epoch, and 20 epochs for 87 years, vram is not actually a problem I guess ...

vukrosic · 2023-01-14T05:36:39Z

@lxj616 @lucidrains @Samge0 @hxngiee @chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image.

Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world.

If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE

chavinlo · 2023-01-14T05:56:20Z

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16

For 13.73s/it , if you train on webvid-10m, you would need 4.35 years to finish one epoch, and 20 epochs for 87 years, vram is not actually a problem I guess ...

This run broke the model somehow...
It has some weird glitch effect, I will upload the gif tomorrow

Now am trying to train it again on a dance video dataset from TikTok but am not sure if its going to work, 6000 steps and the humans do maintain most of their body shape but don't move.

Could it be possible for you to join my disocrd server or hmu so we can discuss further? I do plan on extending training to some bigger and more general dataset.

https://discord.gg/Fy66AEwC

chavinlo · 2023-01-14T14:04:54Z

This is after 20,000 steps. No upgrades
Maybe you could post the script or how you initialized the extra layers? I saw that you use the inpainting model (9 channels)

lxj616 · 2023-01-15T02:34:54Z

This is after 20,000 steps. No upgrades Maybe you could post the script or how you initialized the extra layers? I saw that you use the inpainting model (9 channels)

There could be a lot of things went wrong, I don't know what exact leads to this output

If you train at 240 frames, you should always inference for 240 frames, unless your video is some action loop that cycles again and again
Real person is extremely hard for image/video generation, try landscape/cloudscape what stable-diffusion does best (if stable diffusion itself can not generate good enough people image, how is it possible for video)
I don't know how large is your dataset, did you try training one single video to see if the model/code goes wrong ? In my case I already tested my model can fully remember one single video (12 frames) in less than 200 steps before my toy model experiment, if you worry about the correctness of the code, try one video first, then go for the stars

I'm active at LAION discord server, I tried to join your discord link, it leads to a japanese anime server with lots of people free chatting on loads of different matters, so I quit the server LOL

lxj616 · 2023-01-15T02:48:21Z

@lxj616 @lucidrains @Samge0 @hxngiee @chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image.

Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world.

If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE

Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what pretrained_model_name_or_path to use, and honestly I don't think I can answer that in simple words too, for it's not that simple as you might think, however you are welcome to ask and please understand we can not reply you every time if we don't know how to respond properly, like this time, and last time maybe

vukrosic · 2023-01-15T04:39:36Z

Thank you for the reply. All logic lead me to the fact that I should learn it as well. Honestly it makes me a bit mad that I will need 6 months of every day to learn all of this, but it is what it is. I made the CNN minist digit recognition, so that's something.

…

On Sun, Jan 15, 2023, 03:48 lxj616 ***@***.***> wrote: @lxj616 <https://github.com/lxj616> @lucidrains <https://github.com/lucidrains> @Samge0 <https://github.com/Samge0> @hxngiee <https://github.com/hxngiee> @chavinlo <https://github.com/chavinlo> Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image. Our long term goal is to allow anybody to *create a Hollywood movie in 1 hour*. If it succeeds, it could be one of the biggest companies in the world. If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what pretrained_model_name_or_path to use, and honestly I don't think I can answer that in simple words too, for it's not that simple as you might think, however you are welcome to ask and please understand we can not reply you every time if we don't know how to respond properly, like this time, and last time maybe — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARXZ4MG6TZQUZLOVEMAXQY3WSNQQDANCNFSM6AAAAAATKIXZFQ> . You are receiving this because you commented.Message ID: ***@***.***>

chavinlo · 2023-01-16T00:39:54Z

Thank you for the reply. All logic lead me to the fact that I should learn it as well. Honestly it makes me a bit mad that I will need 6 months of every day to learn all of this, but it is what it is. I made the CNN minist digit recognition, so that's something.
…
On Sun, Jan 15, 2023, 03:48 lxj616 @.> wrote: @lxj616 https://github.com/lxj616 @lucidrains https://github.com/lucidrains @Samge0 https://github.com/Samge0 @hxngiee https://github.com/hxngiee @chavinlo https://github.com/chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image. Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world. If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what pretrained_model_name_or_path to use, and honestly I don't think I can answer that in simple words too, for it's not that simple as you might think, however you are welcome to ask and please understand we can not reply you every time if we don't know how to respond properly, like this time, and last time maybe — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARXZ4MG6TZQUZLOVEMAXQY3WSNQQDANCNFSM6AAAAAATKIXZFQ . You are receiving this because you commented.Message ID: @.>

Not really though
I tinkered with basic cnns and YOLOs back in early '21, then completely forgot about it and came back in august '22 when sd got leaked. I don't exactly know how everything works but I do know enough to do what I want, and I think you can learn what's necessary in under a month.

vukrosic · 2023-01-16T14:05:19Z

Oh, if I'm able to start making something in a month that would be very interesting.

chavinlo · 2023-02-08T21:06:18Z

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset.

It has some problems due to how the dataloader was written.
I am also running another test but this time frozen just like OP's and the dataloader fixed.

Ar0Kim · 2023-02-09T07:52:11Z

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset.

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

Thank you for sharing your video I want to make a video like yours and I'm a newbie here and didn't get how to start..It seems like I have to train this model right? could you tell me in detail please I'd be very grateful.

lucidrains · 2023-02-18T20:49:58Z

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset.

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

omg, this looks great! congratulation with the training!

don't be surprised if your inbox gets flooded with video / AI founders 😆

fhlt · 2023-02-28T07:07:04Z

@lxj616 Can you share the prompt used for training timelapse?

lxj616 · 2023-02-28T07:36:24Z

@lxj616 Can you share the prompt used for training timelapse?

landscape cloudscape photo (for landscape videos)
cityscape cloudscape photo (for city videos)

tasinislam21 · 2023-03-30T19:57:55Z

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset.

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

Would you be able to share your training code?

lxj616 · 2023-04-19T04:14:26Z

@lucidrains

Nvidia just recently published https://arxiv.org/pdf/2304.08818.pdf

And in "3.1.1 Temporal Autoencoder Finetuning" they claimed to "finetune vae decoder on video data with a (patch-wise) temporal discriminator built from 3D convolutions"

This could reduce flickering artifacts as they claim

Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?

You are the only guy I know who can do this, sorry if I bother you and thanks in advance

chavinlo · 2023-04-19T05:32:34Z

@lucidrains

Nvidia just recently published https://arxiv.org/pdf/2304.08818.pdf

+1 They also mentioned using both a larger parameter size and a temporal superresolution based on Stable Diffusion 2.0 superresolution

For the latter, I think they used SDXL

tasinislam21 · 2023-04-19T10:33:28Z

or built from 3D convolutions"

This could reduce flickering artifacts as they claim

Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?

You are the only guy I know who can do this, sorry if I bother you and thanks in advance

It's interesting; they are treating diffusion models like GANs. They used a discriminator to train them.

lopho · 2023-04-20T03:38:04Z

or built from 3D convolutions"
This could reduce flickering artifacts as they claim
Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?
You are the only guy I know who can do this, sorry if I bother you and thanks in advance

It's interesting; they are treating diffusion models like GANs. They used a discriminator to train them.

The VAE is trained with a discriminator, which is how it is normally trained, not the diffusion model.
If you are interested, here is the training step for the VAE of sd1.x, which optimizes both the vae and the discriminator in a two-step manner:
https://github.com/pesser/stable-diffusion/blob/57eea7dfc2cdd8cadae77ab1c391f956d46f69bd/ldm/models/autoencoder.py#L351

I combined your code with diffusers stable diffusion and trained a model #10

I combined your code with diffusers stable diffusion and trained a model #10

Comments

lxj616 commented Dec 27, 2022

hxngiee commented Dec 27, 2022

lxj616 commented Dec 27, 2022

hxngiee commented Dec 27, 2022

lxj616 commented Dec 27, 2022

hxngiee commented Dec 27, 2022

lucidrains commented Dec 27, 2022 • edited Loading

lucidrains commented Dec 27, 2022

Samge0 commented Dec 29, 2022

chavinlo commented Dec 29, 2022

chavinlo commented Dec 29, 2022

lxj616 commented Dec 29, 2022

chavinlo commented Dec 29, 2022

chavinlo commented Dec 29, 2022

lxj616 commented Dec 30, 2022

chavinlo commented Dec 30, 2022

lxj616 commented Dec 30, 2022

chavinlo commented Jan 5, 2023

chavinlo commented Jan 5, 2023

lxj616 commented Jan 5, 2023

chavinlo commented Jan 13, 2023

lxj616 commented Jan 14, 2023

vukrosic commented Jan 14, 2023

chavinlo commented Jan 14, 2023 • edited Loading

chavinlo commented Jan 14, 2023 • edited Loading

lxj616 commented Jan 15, 2023

lxj616 commented Jan 15, 2023

vukrosic commented Jan 15, 2023 via email

chavinlo commented Jan 16, 2023

vukrosic commented Jan 16, 2023

chavinlo commented Feb 8, 2023

Ar0Kim commented Feb 9, 2023

lucidrains commented Feb 18, 2023

fhlt commented Feb 28, 2023

lxj616 commented Feb 28, 2023

tasinislam21 commented Mar 30, 2023

lxj616 commented Apr 19, 2023

chavinlo commented Apr 19, 2023

tasinislam21 commented Apr 19, 2023

lopho commented Apr 20, 2023

lucidrains commented Dec 27, 2022 •

edited

Loading

chavinlo commented Jan 14, 2023 •

edited

Loading

chavinlo commented Jan 14, 2023 •

edited

Loading