在运行这个训练代码前,我们需要安装下面的训练依赖。
# paddlepaddle-gpu>=2.6.0
python -m pip install paddlepaddle-gpu==2.6.0 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install -r requirements.txt
训练Large-DiT(DiT-LLaMA)模型需安装fused_ln
,需要安装此目录下的自定义OP, python setup.py install
。
├── data # 我们指定的输出文件路径
├──fastdit_imagenet256
├── imagenet256_features
├── imagenet256_labels
我们提供了下载链接:
wget https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/fastdit_features/fastdit_imagenet256.tar
;- 特征抽取流程请参考fast-DiT
Tips:
- FP32 在默认总batch_size=256情况下需占 42GB 显存每卡。
- FP16 在默认总batch_size=256情况下需占 21GB 显存每卡。
可以直接运行sh 0_run_train_dit_trainer.sh
,或者
TRAINING_MODEL_RESUME="None"
TRAINER_INSTANCES='127.0.0.1'
MASTER='127.0.0.1:8080'
TRAINERS_NUM=1 # nnodes, machine num
TRAINING_GPUS_PER_NODE=8 # nproc_per_node
DP_DEGREE=1 # dp_parallel_degree
MP_DEGREE=1 # tensor_parallel_degree
SHARDING_DEGREE=8 # sharding_parallel_degree
# real dp_parallel_degree = nnodes * nproc_per_node / tensor_parallel_degree / sharding_parallel_degree
# Please make sure: nnodes * nproc_per_node >= tensor_parallel_degree * sharding_parallel_degree
config_file=config/DiT_XL_patch2.json
OUTPUT_DIR=./output_trainer/DiT_XL_patch2_trainer
feature_path=./data/fastdit_imagenet256
per_device_train_batch_size=32
gradient_accumulation_steps=1
num_workers=8
max_steps=7000000
logging_steps=20
save_steps=5000
image_logging_steps=-1
seed=0
max_grad_norm=-1
USE_AMP=True
FP16_OPT_LEVEL="O2"
enable_tensorboard=True
recompute=True
enable_xformers=True
transformer_engine_backend=False
use_fp8=False # This option takes effect only when transformer_engine_backend=True
TRAINING_PYTHON="python -m paddle.distributed.launch --master ${MASTER} --nnodes ${TRAINERS_NUM} --nproc_per_node ${TRAINING_GPUS_PER_NODE} --ips ${TRAINER_INSTANCES}"
${TRAINING_PYTHON} train_image_generation_trainer.py \
--do_train \
--feature_path ${feature_path} \
--output_dir ${OUTPUT_DIR} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--learning_rate 1e-4 \
--weight_decay 0.0 \
--max_steps ${max_steps} \
--lr_scheduler_type "constant" \
--warmup_steps 0 \
--image_logging_steps ${image_logging_steps} \
--logging_dir ${OUTPUT_DIR}/tb_log \
--logging_steps ${logging_steps} \
--save_steps ${save_steps} \
--save_total_limit 50 \
--dataloader_num_workers ${num_workers} \
--vae_name_or_path stabilityai/sd-vae-ft-mse \
--config_file ${config_file} \
--num_inference_steps 25 \
--use_ema True \
--max_grad_norm ${max_grad_norm} \
--overwrite_output_dir True \
--disable_tqdm True \
--fp16_opt_level ${FP16_OPT_LEVEL} \
--seed ${seed} \
--recompute ${recompute} \
--enable_xformers_memory_efficient_attention ${enable_xformers} \
--bf16 ${USE_AMP} \
--dp_degree ${DP_DEGREE} \
--tensor_parallel_degree ${MP_DEGREE} \
--sharding_parallel_degree ${SHARDING_DEGREE} \
--sharding "stage1" \
--hybrid_parallel_topo_order "sharding_first" \
--amp_master_grad 1 \
--pipeline_parallel_degree 1 \
--sep_parallel_degree 1 \
--transformer_engine_backend ${transformer_engine_backend} \
--use_fp8 ${use_fp8}
注意显存约占 21GB 每卡。
可以直接运行sh 1_run_train_dit_notrainer.sh
,或者
config_file=config/DiT_XL_patch2.json
results_dir=./output_notrainer/DiT_XL_patch2_notrainer
feature_path=./data/fastdit_imagenet256
image_size=256
global_batch_size=256
num_workers=8
epochs=1400
logging_steps=50
save_steps=5000
global_seed=0
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" \
train_image_generation_notrainer.py \
--image_size ${image_size} \
--config_file ${config_file} \
--feature_path ${feature_path} \
--results_dir ${results_dir} \
--epochs ${epochs} \
--global_seed ${global_seed} \
--global_batch_size ${global_batch_size} \
--num_workers ${num_workers} \
--log_every ${logging_steps} \
--ckpt_every ${save_steps} \
同1.3.1,与之前手动并行方式显存占用差别不大
可以直接运行sh 0_run_train_dit_trainer_auto.sh
,或者
FLAGS_enable_pir_api=true
TRAINING_MODEL_RESUME="None"
TRAINER_INSTANCES='127.0.0.1'
MASTER='127.0.0.1:8080'
TRAINERS_NUM=1 # nnodes, machine num
TRAINING_GPUS_PER_NODE=8 # nproc_per_node
DP_DEGREE=1 # data_parallel_degree
MP_DEGREE=1 # tensor_parallel_degree
PP_DEGREE=1 # pipeline_parallel_degree
SHARDING_DEGREE=8 # sharding_parallel_degree
# real dp_parallel_degree = nnodes * nproc_per_node / tensor_parallel_degree / sharding_parallel_degree
# Please make sure: nnodes * nproc_per_node >= tensor_parallel_degree * sharding_parallel_degree
config_file=config/DiT_XL_patch2.json
OUTPUT_DIR=./output_trainer/DiT_XL_patch2_auto_trainer
feature_path=./data/fastdit_imagenet256
per_device_train_batch_size=32
gradient_accumulation_steps=1
resolution=256
num_workers=8
max_steps=7000000
logging_steps=1
save_steps=5000
image_logging_steps=-1
seed=0
max_grad_norm=-1
USE_AMP=True
FP16_OPT_LEVEL="O2"
enable_tensorboard=True
recompute=True
enable_xformers=True
to_static=0 # whether we use dynamic to static training
TRAINING_PYTHON="python -m paddle.distributed.launch --master ${MASTER} --nnodes ${TRAINERS_NUM} --nproc_per_node ${TRAINING_GPUS_PER_NODE} --ips ${TRAINER_INSTANCES}"
${TRAINING_PYTHON} train_image_generation_trainer_auto.py \
--do_train \
--feature_path ${feature_path} \
--output_dir ${OUTPUT_DIR} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--learning_rate 1e-4 \
--weight_decay 0.0 \
--resolution ${resolution} \
--max_steps ${max_steps} \
--lr_scheduler_type "constant" \
--warmup_steps 0 \
--image_logging_steps ${image_logging_steps} \
--logging_dir ${OUTPUT_DIR}/tb_log \
--logging_steps ${logging_steps} \
--save_steps ${save_steps} \
--save_total_limit 50 \
--dataloader_num_workers ${num_workers} \
--vae_name_or_path stabilityai/sd-vae-ft-mse \
--config_file ${config_file} \
--num_inference_steps 25 \
--use_ema True \
--max_grad_norm ${max_grad_norm} \
--overwrite_output_dir True \
--disable_tqdm True \
--fp16_opt_level ${FP16_OPT_LEVEL} \
--seed ${seed} \
--recompute ${recompute} \
--enable_xformers_memory_efficient_attention ${enable_xformers} \
--bf16 ${USE_AMP} \
--amp_master_grad 1 \
--dp_degree ${DP_DEGREE} \
--tensor_parallel_degree ${MP_DEGREE} \
--pipeline_parallel_degree ${PP_DEGREE} \
--sharding_parallel_degree ${SHARDING_DEGREE} \
--sharding "stage1" \
--sharding_parallel_config "enable_stage1_overlap enable_stage1_tensor_fusion" \
--hybrid_parallel_topo_order "sharding_first" \
--sep_parallel_degree 1 \
--enable_auto_parallel 1 \
--to_static $to_static
可以直接运行sh 4_run_train_largedit_3b_trainer_auto.sh
,或者
FLAGS_enable_pir_api=true
TRAINING_MODEL_RESUME="None"
TRAINER_INSTANCES='127.0.0.1'
MASTER='127.0.0.1:8080'
TRAINERS_NUM=1 # nnodes, machine num
TRAINING_GPUS_PER_NODE=8 # nproc_per_node
DP_DEGREE=1 # data_parallel_degree
MP_DEGREE=4 # tensor_parallel_degree
PP_DEGREE=1 # pipeline_parallel_degree
SHARDING_DEGREE=2 # sharding_parallel_degree
# real dp_parallel_degree = nnodes * nproc_per_node / tensor_parallel_degree / sharding_parallel_degree
# Please make sure: nnodes * nproc_per_node >= tensor_parallel_degree * sharding_parallel_degree
config_file=config/LargeDiT_3B_patch2.json
OUTPUT_DIR=./output_trainer/LargeDiT_3B_patch2_auto_trainer
feature_path=./data/fastdit_imagenet256
per_device_train_batch_size=32
gradient_accumulation_steps=1
resolution=256
num_workers=8
max_steps=7000000
logging_steps=1
save_steps=5000
image_logging_steps=-1
seed=0
max_grad_norm=2.0
USE_AMP=True
FP16_OPT_LEVEL="O2"
enable_tensorboard=True
recompute=True
enable_xformers=True
to_static=0 # whether we use dynamic to static training
TRAINING_PYTHON="python -m paddle.distributed.launch --master ${MASTER} --nnodes ${TRAINERS_NUM} --nproc_per_node ${TRAINING_GPUS_PER_NODE} --ips ${TRAINER_INSTANCES}"
${TRAINING_PYTHON} train_image_generation_trainer_auto.py \
--do_train \
--feature_path ${feature_path} \
--output_dir ${OUTPUT_DIR} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--learning_rate 1e-4 \
--resolution ${resolution} \
--weight_decay 0.0 \
--max_steps ${max_steps} \
--lr_scheduler_type "constant" \
--warmup_steps 0 \
--image_logging_steps ${image_logging_steps} \
--logging_dir ${OUTPUT_DIR}/tb_log \
--logging_steps ${logging_steps} \
--save_steps ${save_steps} \
--save_total_limit 50 \
--dataloader_num_workers ${num_workers} \
--vae_name_or_path stabilityai/sd-vae-ft-mse \
--config_file ${config_file} \
--num_inference_steps 25 \
--use_ema True \
--max_grad_norm ${max_grad_norm} \
--overwrite_output_dir True \
--disable_tqdm True \
--fp16_opt_level ${FP16_OPT_LEVEL} \
--seed ${seed} \
--recompute ${recompute} \
--enable_xformers_memory_efficient_attention ${enable_xformers} \
--bf16 ${USE_AMP} \
--amp_master_grad 1 \
--dp_degree ${DP_DEGREE} \
--tensor_parallel_degree ${MP_DEGREE} \
--pipeline_parallel_degree ${PP_DEGREE} \
--sharding_parallel_degree ${SHARDING_DEGREE} \
--sharding "stage1" \
--sharding_parallel_config "enable_stage1_overlap enable_stage1_tensor_fusion" \
--hybrid_parallel_topo_order "sharding_first" \
--sep_parallel_degree 1 \
--enable_auto_parallel 1 \
--to_static $to_static
可以直接运行python infer_demo_dit.py
、python infer_demo_sit.py
、python infer_demo_largedit_3b.py
或者python infer_demo_largedit_7b.py
待模型训练完毕,会在output_dir
保存训练好的模型权重。注意DiT模型推理可以使用ppdiffusers中的DiTPipeline,但是SiT模型推理暂时不支持生成Pipeline
。
DiT可以使用tools/convert_dit_to_ppdiffusers.py
生成推理所使用的Pipeline
。
python tools/convert_dit_to_ppdiffusers.py
输出的模型目录结构如下:
├── DiT_XL_2_256 # 我们指定的输出文件路径
├── model_index.json
├── scheduler
│ └── scheduler_config.json
├── transformer
│ ├── config.json
│ └── model_state.pdparams
└── vae
├── config.json
└── model_state.pdparams
注意生成后的model_index.json
里需要有"id2label"
的1000类的id和标签对应字典,如果没有则需要手动复制tools/ImageNet_id2label.json
里的加进去。
在生成Pipeline
的权重后,我们可以使用如下的代码进行推理。
import paddle
from paddlenlp.trainer import set_seed
from ppdiffusers import DDIMScheduler, DiTPipeline
dtype = paddle.float32
pipe = DiTPipeline.from_pretrained("./DiT_XL_2_256", paddle_dtype=dtype)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
words = ["golden retriever"] # class_ids [207]
class_ids = pipe.get_label_ids(words)
set_seed(42)
generator = paddle.Generator().manual_seed(0)
image = pipe(class_labels=class_ids, num_inference_steps=25, generator=generator).images[0]
image.save("result_DiT_golden_retriever.png")
- Paddle Inference提供DIT模型高性能推理实现,推理性能提升80%+ 环境准备:
# 安装develop版本的paddle
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu123/
# 安装 triton并适配paddle
python -m pip install triton
python -m pip install git+https://github.com/zhoutianzi666/UseTritonInPaddle.git
python -c "import use_triton_in_paddle; use_triton_in_paddle.make_triton_compatible_with_paddle()"
一键推理指令:
python ppdiffusers/examples/inference/class_conditional_image_generation-dit.py --inference_optimize 1
- 在 NVIDIA A100-SXM4-40GB 上测试的性能如下:
Paddle Inference | TensorRT-LLM | Paddle动态图 |
---|---|---|
219 ms | 242 ms | 1200 ms |
@article{Peebles2022DiT,
title={Scalable Diffusion Models with Transformers},
author={William Peebles and Saining Xie},
year={2022},
journal={arXiv preprint arXiv:2212.09748},
}