Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I combined your code with diffusers stable diffusion and trained a model #10

Open
lxj616 opened this issue Dec 27, 2022 · 39 comments
Open

Comments

@lxj616
Copy link

lxj616 commented Dec 27, 2022

gif_small

https://github.com/lxj616/make-a-stable-diffusion-video

Used your Pseudo3DConv and pseudo 3d attention here:

https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/resnet_pseudo3d.py#L8
https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/attention_pseudo3d.py#L432

Thank you for opensource the Pseudo3D code, it seems to be working

@hxngiee
Copy link

hxngiee commented Dec 27, 2022

Hi, @lxj616
Thanks for your interesting Make A Stable Diffusion Video
I wonder how to train your Pretrained toy model
Could you explain a little bit more how to trained it?

@lxj616
Copy link
Author

lxj616 commented Dec 27, 2022

@hxngiee I trained the model using examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint.py from thedarkzeno or patil-suraj

And because we are doing video, load the dataset as (b, c, f, h, w) instead of (b, c, h, w), and everything else are taken care of by the original script author, about how to do fp16/accelerate/8-bit adam please view the README of the dreambooth subfolder, they are mostly out of the box usable

If you need more explanations, I could also share my train_dreambooth.py, but I wrote very messy bad code and does not even rename it LOL, lots of hardcoding hacky tricks, I guess you'll end up rewrite the original train_dreambooth_inpaint.py and that's faster than to debug mine

@hxngiee
Copy link

hxngiee commented Dec 27, 2022

@lxj616 cool
Is this method fine-tuning approach which applying a model trained with image to video? or Using video model and data from the training stage?
Actually I don't read the dreambooth paper closely

@lxj616
Copy link
Author

lxj616 commented Dec 27, 2022

@hxngiee It's the text2image model with new temporal layers, the text2image model is the stable diffusion, the new layer needs to be trained similar to dreambooth example, since you ask how to train a model in diffusers ...

Not finetuning the text2image model, the backbone is frozen, only training the new layers
And the new layers deal with video only, not optimizing on image generation

maybe you wish to read train_dreambooth_inpaint.py to understand how to train this video model, but don't get the ideas wrong, we are talking make-a-video not dreambooth

@hxngiee
Copy link

hxngiee commented Dec 27, 2022

@lxj616 Thank you for your reply. I understand what you did. To make a video, adding temporal consistency layer and train them similar to dreambooth.

Pseudo3DConv and pseudo 3d attention were effective in training video diffusion model.

Thanks for sharing your finding and I will look at the code closely!

@lucidrains
Copy link
Owner

lucidrains commented Dec 27, 2022

@lxj616 nice! yea, i still need to complete https://github.com/lucidrains/classifier-free-guidance-pytorch , and then integrate this all into dalle2-pytorch

should be all complete in early january. if you cannot wait, try using this

@lucidrains
Copy link
Owner

@lxj616 cat is looking quite majestic 🤣 let's make it move in 2023

@Samge0
Copy link

Samge0 commented Dec 29, 2022

nice~

@chavinlo
Copy link

Amazing job.

@chavinlo
Copy link

If you need more explanations, I could also share my train_dreambooth.py

Could it be possible to do so?

@lxj616
Copy link
Author

lxj616 commented Dec 29, 2022

@chavinlo I dropped my messy script at https://gist.github.com/lxj616/5134368f44aca837304530695ee100ea

But I bet it would be quicker if you modify the original train_dreambooth.py from diffusers than to debug mine, I barely make it run on my specific environment, it has 99.9% chance not gonna run on your system LOL

@chavinlo
Copy link

@chavinlo I dropped my messy script at https://gist.github.com/lxj616/5134368f44aca837304530695ee100ea

But I bet it would be quicker if you modify the original train_dreambooth.py from diffusers than to debug mine, I barely make it run on my specific environment, it has 99.9% chance not gonna run on your system LOL

Thanks. Could it be also possible to release the webdataset making code?
and you just used the CLIPTextModel encoder (from text_encoder folder) to create the txt_embed on the npz right?

@chavinlo
Copy link

I've read your blog about VRAM limitations. If you need more compute, I can give you an A100 to experiment.

@lxj616
Copy link
Author

lxj616 commented Dec 30, 2022

@chavinlo Thanks for asking but 24GB is enough for testing if I pre-compute the embeddings and save them into webdataset, since I see you got A100 (perhaps 40GB vram), you do not need my webdataset making code, you can just load a video and vae encode it on the fly (which is much more easier to use), my webdataset making is actually done in python interactive shell and did not log a python script because I thought it was one-time thing per dataset, I may need to log everything down on my next attempt ...

@chavinlo
Copy link

@lxj616 Thanks. One more question, in the preprocess function, you treat npz as if it had all the videos? because it itterates through it, and adds all the frames of npz_i into f8_list, and keeps doing it until there are no more npz_i left? Finally, does example['f8'] contains all the video's frames? or just a single video's frame?

@lxj616
Copy link
Author

lxj616 commented Dec 30, 2022

@chavinlo One npz contains all video frames of one single video, the loop is dealing with a batch, and the final example['f8'] is a batch of video frames with shape (b c f h w), where f is the frame length

@chavinlo
Copy link

chavinlo commented Jan 5, 2023

@lxj616 Hello again, I got training working with bs 1 and 25 frames. Although I had to convert the model to BFloat16, because I got OOM with fp32 (80GB+), and loss=nan with fp16. I see that you mentioned that you used fp16 and 8bit. How did you managed to use them? I can't use 8bit with my current setup because it won't work with bf16.

@chavinlo
Copy link

chavinlo commented Jan 5, 2023

Also, bf16 uses 44GB, but when using grad checkpointing, it decreases to 11GB

@lxj616
Copy link
Author

lxj616 commented Jan 5, 2023

@chavinlo Hmmm... I never met this problem, I just use the original code and when running accelerate config choose fp16 and everything seems to work, maybe, there were some warnings said [INFO] [stage_1_and_2.py:1769:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 but I don't know what does it mean, BTW I didn't change anything for 8-bit adam, maybe it does not work and I didn't discover ....

@chavinlo
Copy link

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16
image

@lxj616
Copy link
Author

lxj616 commented Jan 14, 2023

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16 image

For 13.73s/it , if you train on webvid-10m, you would need 4.35 years to finish one epoch, and 20 epochs for 87 years, vram is not actually a problem I guess ...

@vukrosic
Copy link

@lxj616 @lucidrains @Samge0 @hxngiee @chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image.

Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world.

If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE

@chavinlo
Copy link

chavinlo commented Jan 14, 2023

Training with 240 frames... it's really big. 74GB of VRAM usage WITH gradient checkpointing and bf16 image

For 13.73s/it , if you train on webvid-10m, you would need 4.35 years to finish one epoch, and 20 epochs for 87 years, vram is not actually a problem I guess ...

This run broke the model somehow...
It has some weird glitch effect, I will upload the gif tomorrow

Now am trying to train it again on a dance video dataset from TikTok but am not sure if its going to work, 6000 steps and the humans do maintain most of their body shape but don't move.

Could it be possible for you to join my disocrd server or hmu so we can discuss further? I do plan on extending training to some bigger and more general dataset.

https://discord.gg/Fy66AEwC

@chavinlo
Copy link

chavinlo commented Jan 14, 2023

gif_v26400_0
This is after 20,000 steps. No upgrades
Maybe you could post the script or how you initialized the extra layers? I saw that you use the inpainting model (9 channels)

@lxj616
Copy link
Author

lxj616 commented Jan 15, 2023

gif_v26400_0 gif_v26400_0 This is after 20,000 steps. No upgrades Maybe you could post the script or how you initialized the extra layers? I saw that you use the inpainting model (9 channels)

There could be a lot of things went wrong, I don't know what exact leads to this output

  1. If you train at 240 frames, you should always inference for 240 frames, unless your video is some action loop that cycles again and again
  2. Real person is extremely hard for image/video generation, try landscape/cloudscape what stable-diffusion does best (if stable diffusion itself can not generate good enough people image, how is it possible for video)
  3. I don't know how large is your dataset, did you try training one single video to see if the model/code goes wrong ? In my case I already tested my model can fully remember one single video (12 frames) in less than 200 steps before my toy model experiment, if you worry about the correctness of the code, try one video first, then go for the stars

I'm active at LAION discord server, I tried to join your discord link, it leads to a japanese anime server with lots of people free chatting on loads of different matters, so I quit the server LOL

@lxj616
Copy link
Author

lxj616 commented Jan 15, 2023

@lxj616 @lucidrains @Samge0 @hxngiee @chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image.

Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world.

If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE

Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what pretrained_model_name_or_path to use, and honestly I don't think I can answer that in simple words too, for it's not that simple as you might think, however you are welcome to ask and please understand we can not reply you every time if we don't know how to respond properly, like this time, and last time maybe

@vukrosic
Copy link

vukrosic commented Jan 15, 2023 via email

@chavinlo
Copy link

Thank you for the reply. All logic lead me to the fact that I should learn it as well. Honestly it makes me a bit mad that I will need 6 months of every day to learn all of this, but it is what it is. I made the CNN minist digit recognition, so that's something.

On Sun, Jan 15, 2023, 03:48 lxj616 @.> wrote: @lxj616 https://github.com/lxj616 @lucidrains https://github.com/lucidrains @Samge0 https://github.com/Samge0 @hxngiee https://github.com/hxngiee @chavinlo https://github.com/chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image. Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world. If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what pretrained_model_name_or_path to use, and honestly I don't think I can answer that in simple words too, for it's not that simple as you might think, however you are welcome to ask and please understand we can not reply you every time if we don't know how to respond properly, like this time, and last time maybe — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARXZ4MG6TZQUZLOVEMAXQY3WSNQQDANCNFSM6AAAAAATKIXZFQ . You are receiving this because you commented.Message ID: @.>

Not really though
I tinkered with basic cnns and YOLOs back in early '21, then completely forgot about it and came back in august '22 when sd got leaked. I don't exactly know how everything works but I do know enough to do what I want, and I think you can learn what's necessary in under a month.

@vukrosic
Copy link

Oh, if I'm able to start making something in a month that would be very interesting.

@chavinlo
Copy link

chavinlo commented Feb 8, 2023

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset.
output_7
output_2

It has some problems due to how the dataloader was written.
I am also running another test but this time frozen just like OP's and the dataloader fixed.

@Ar0Kim
Copy link

Ar0Kim commented Feb 9, 2023

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset. output_7 output_7 output_2 output_2

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

Thank you for sharing your video I want to make a video like yours and I'm a newbie here and didn't get how to start..It seems like I have to train this model right? could you tell me in detail please I'd be very grateful.

@lucidrains
Copy link
Owner

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset. output_7 output_7 output_2 output_2

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

omg, this looks great! congratulation with the training!

don't be surprised if your inbox gets flooded with video / AI founders 😆

@fhlt
Copy link

fhlt commented Feb 28, 2023

@lxj616 Can you share the prompt used for training timelapse?

@lxj616
Copy link
Author

lxj616 commented Feb 28, 2023

@lxj616 Can you share the prompt used for training timelapse?

landscape cloudscape photo (for landscape videos)
cityscape cloudscape photo (for city videos)

@tasinislam21
Copy link

It's been a while since I posted here. Since my last response, and after many trials and errors (and tremendous help by lopho) I got it trained on a small dance dataset. output_7 output_7 output_2 output_2

It has some problems due to how the dataloader was written. I am also running another test but this time frozen just like OP's and the dataloader fixed.

Would you be able to share your training code?

@lxj616
Copy link
Author

lxj616 commented Apr 19, 2023

@lucidrains

Nvidia just recently published https://arxiv.org/pdf/2304.08818.pdf

And in "3.1.1 Temporal Autoencoder Finetuning" they claimed to "finetune vae decoder on video data with a (patch-wise) temporal discriminator built from 3D convolutions"

This could reduce flickering artifacts as they claim

Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?

You are the only guy I know who can do this, sorry if I bother you and thanks in advance

@chavinlo
Copy link

@lucidrains

Nvidia just recently published https://arxiv.org/pdf/2304.08818.pdf

+1 They also mentioned using both a larger parameter size and a temporal superresolution based on Stable Diffusion 2.0 superresolution

For the latter, I think they used SDXL

@tasinislam21
Copy link

or built from 3D convolutions"

This could reduce flickering artifacts as they claim

Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?

You are the only guy I know who can do this, sorry if I bother you and thanks in advance

It's interesting; they are treating diffusion models like GANs. They used a discriminator to train them.

@lopho
Copy link
Contributor

lopho commented Apr 20, 2023

or built from 3D convolutions"
This could reduce flickering artifacts as they claim
Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ?
You are the only guy I know who can do this, sorry if I bother you and thanks in advance

It's interesting; they are treating diffusion models like GANs. They used a discriminator to train them.

The VAE is trained with a discriminator, which is how it is normally trained, not the diffusion model.
If you are interested, here is the training step for the VAE of sd1.x, which optimizes both the vae and the discriminator in a two-step manner:
https://github.com/pesser/stable-diffusion/blob/57eea7dfc2cdd8cadae77ab1c391f956d46f69bd/ldm/models/autoencoder.py#L351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants