The result looks distorted despite my effect to ensure the inputs to the unet are the same as diffusers reference code. I am still investigating the cause.
Adapted from https://github.com/harubaru/waifu-diffusion/blob/main/trainer/diffusers_trainer.py With the following additions:
- SDXL support
- FP32 for VAE
- Designed for booru tags
Have a dataset of images with caption files end with .txt
, e.g. danbooru2021/0000/1000.jpg
and danbooru2021/0000/1000.txt
The content of txt file is something like
bad aesthetic,gen:panties,gen:oekaki,char:amano_misao_(battle_programmer_shirase),art:haganemaru_kennosuke,meta:lowres,gen:open_mouth,gen:panty_pull,gen:white_panties,gen:school_uniform,gen:1girl,copy:battle_programmer_shirase,gen:underwear,gen:blush,gen:jaggy_line,gen:long_hair,gen:solo
comma seperated tags, <category>:<tag>
, category can be shorted to save space.
things like aesthetic are taken from waifu diffusion
44GB of VRAM used for batch size of 2 at bucket resolution 896x896