-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image generation with deepspeed --fp16 #394
base: main
Are you sure you want to change the base?
Conversation
The VQGAN simply won't work in 16-bit precision unfortunately. Converting only the torch modules of dalle which aren't VQGAN, and then forcing autocasting to fp32 for the vqgan mitigates this issue and still gives a similar/the same speedup. It also fixes the issue where you couldn't actually decode when training in fp16 mode and had to wait until after to upconvert your checkpoint to 32 bit. edit: Okay here's a little more due diligence https://wandb.ai/afiaka87/vqgan_precision_fixes it's probably wise to test this on distributed as well. another edit: |
Wow amazing! Is that really enough to make it work ? |
please test it! but i think so yes. |
@janEbert end of an era, eh? |
@lucidrains @rom1504 Has some stability issues surrounding the top_k function I think. Without DeepSpeed to auto-skip NaN's in Pytorch native, training can break after awhile. This was alleviated quite a bit by using both the It's probably good to disable by default using the @lucidrains Do you have plans for continued progress on the repository here? I mostly just wanted to push this up because it had been bothering me so much - but I'm curious if you still intend to create a NUWA repo? Perhaps a clean start? |
@afiaka87 hey! thanks for reporting on the stable and sandwich norm. that lines up with my experiences could you point to the line of code for I think this repository is mostly feature complete, and has mostly fulfilled its objective given the successful release of ruDALL-E. what other features would you like to see? I could also add the scaled cosine similarity attention from SwinV2 for additional stability (https://arxiv.org/abs/2111.09883) . That's the only remaining thing I can think of |
@afiaka87
maybe we need a specific version of something? |
ok I confirm this code is working with torch 1.10, however one drawback is it increases the vram usage (because it's loaded vqgan as float32 instead of float16) |
https://wandb.ai/rom1504/laion_subset/reports/DALLE-dino--VmlldzoxMjg5OTgz here's the experiment with it, which is now nicely displaying samples |
Did you train long enough to see any NaN/Inf errors? I intend to disable it by default by using the context manager inside just the training loop; so with autocast(enabled=args.amp and not using_deepspeed):
loss = dalle(..)
# backprop
# zero gradients
# ... |
I started the training 5min ago, so no I don't know what do you intend to disable? |
Sorry this only effects non-distributed pytorch. Are you using 16-bit precision with DeepSpeed? My current impl of mixed precision for pytorch was enabled by default. Due to stability issues I've decided to make it optional. edit: @rom1504 for more context - I think DeepSpeed's automatic NaN skipping subverts the issue. |
oh shoot, i wasn't aware of this issue do you want to see if 1.1.5 fixes this? https://github.com/lucidrains/DALLE-pytorch/releases/tag/1.1.5 |
yes |
@lucidrains Yes that stabilized the training thanks |
Hm - it looks like the dtype specifier isn't available on pytorch LTS. Must be new. I don't know of another way to solve the issue (for deepspeed), however. It would be nice to provide a preprocessor that pre-encodes vqgan encodings to numpy files. @rom1504 didn't the training@home team put a bunch of LAION encoded via the gumbel vqgan on huggingfaces? |
Yes several people did that, but nobody packaged a vqgan inference script properly, it would be useful to do |
I think it's ok to depend on torch 1.10 ; just need to say it in the readme |
I am not sure why exactly but this increases the vram use a lot in multi gpu mode |
Yes, I would be somewhat more comfortable with a hard requirement on Pytorch 1.10 if it didn't also mean a harsh decision of only CUDA 11.3 (unavailable for my operating system presently) or all the way back to CUDA 10.2. This is relevant for DeepSpeed support as many of their fused operations have very strange support for CUDA versioning that I haven't quite worked out (and seems to change with each change to their main branch). For instance, by choosing this scheme I can comfortably use 16-bit with stage 3 - but attempting to use the DeepSpeed "FusedAdam" on my CPU results in a failed compilation if I don't use Pytorch 1.8.2 with CUDA 11.1; but then of course I can't use the casting features and am stuck again with 32-bit inference. tl;dr - Forcing a Pytorch version is forcing a CUDA version which isn't always an option for a variety of setups. @rom1504 - With regard to your comment about VRAM usage increasing; that's not good! I guess it probably has to do with DeepSpeed anticipating 16-bit precision for external parameters - perhaps to the point of using algorithms which have tradeoffs for 32-bit. That sounds very challenging to actually debug though. At any rate, I think leaving this as an open PR is maybe a good way to inform people that it's possible and that there are some known issues. We could also close this PR and use a pinned issue? @lucidrains whichever you feel works best |
I'm considering a package similar to your clip-retrieval repo centered around "encode your raw dataset first, use e.g. |
Big congratulations on finally fixing this! Although if I understand correctly, doesn't this simply "disable" FP16 mode for the erroneous functions? :D Sorry I never found the time to work on the DeepSpeed stuff further. From what I learned about DeepSpeed, we would've had to rewrite the model classes a bit in order to make everything work (i.e. have the |
No description provided.