You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
❕It does work without '--train_text_encoder'. It seems that there might be a memory leak or issue with training the text encoder with the current script / model.
❓Does it make sense that the model uses over 80GB of VRAM?
❓Do you have any recommendations on decreasing VRAM usage
Other than:
. 8bit Adam
. Mixed precision 16fp
. xformers (that doesn't work with SD3.5)
💡Idea:
After successfully training with the Kohya-ss scripts: Relevant Repo,
I have deducted that the issue might be with the Dreambooth scripts here not using 8bitAdam properly; either ignoring or a bug might be in the implementation itself. This is due to the fact that the only single parameter that had a massive effect on VRAM and caused a massive surge is not using Adam8Bit optimizer, otherwise the seemingly same parameters in Kohya-ss.
2024-12-02 12:36:35.615846: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1733142995.629356 226993 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733142995.633681 226993 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
12/02/2024 12:36:39 - INFO - main - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
You are using a model of type clip_text_model to instantiate a model of type. This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type. This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type. This is not supported for all configurations of models and can yield errors.
{'base_shift', 'max_image_seq_len', 'max_shift', 'base_image_seq_len', 'invert_sigmas', 'use_dynamic_shifting'} was not found in config. Values will be initialized to default values.
Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 3450.68it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:03<00:00, 1.73s/it]
Fetching 2 files: 100%|█████████████████████████| 2/2 [00:00<00:00, 7476.48it/s]
{'dual_attention_layers'} was not found in config. Values will be initialized to default values.
12/02/2024 12:37:04 - INFO - main - ***** Running training *****
12/02/2024 12:37:04 - INFO - main - Num examples = 1
12/02/2024 12:37:04 - INFO - main - Num batches each epoch = 1
12/02/2024 12:37:04 - INFO - main - Num Epochs = 800
12/02/2024 12:37:04 - INFO - main - Instantaneous batch size per device = 1
12/02/2024 12:37:04 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 2
12/02/2024 12:37:04 - INFO - main - Gradient Accumulation steps = 2
12/02/2024 12:37:04 - INFO - main - Total optimization steps = 800
Steps: 0%|| 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/azureuser/Picturethis/Dima/train_dreambooth_sd3.py", line 1811, in
main(args)
File "/home/azureuser/Picturethis/Dima/train_dreambooth_sd3.py", line 1666, in main
optimizer.step()
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.get(opt, opt.class)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 288, in step
self.init_state(group, p, gindex, pindex)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 474, in init_state
state["state2"] = self.get_state_buffer(p, dtype=torch.uint8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 328, in get_state_buffer
return torch.zeros_like(p, dtype=dtype, device=p.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 10.62 MiB is free. Process 68964 has 530.00 MiB memory in use. Including non-PyTorch memory, this process has 78.45 GiB memory in use. Of the allocated memory 75.60 GiB is allocated by PyTorch, and 2.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps: 0%|| 0/800 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/azureuser/mambaforge/envs/picturevenv/bin/accelerate", line 8, insys.exit(main())
^^^^^^
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/azureuser/mambaforge/envs/picturevenv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/azureuser/mambaforge/envs/picturevenv/bin/python3.11', 'train_dreambooth_sd3.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-3.5-large', '--output_dir=sd_outputs', '--instance_data_dir=ogo', '--instance_prompt=the face of ogo person', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--checkpointing_steps=200', '--learning_rate=2e-6', '--text_encoder_lr=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=800', '--seed=0', '--use_8bit_adam']' returned non-zero exit status 1.
System Info
System 🖥️
A100 Azure Remote Server.
Running the code from Jupyter Notebook.
kohya-ss has a separate option to enable training t5xxl whereas --train_text_encoder in train_dreambooth_sd3.py enables training for all text encoders, this may account for the difference in usage if other parameters are the same. We could consider a similar option to enable t5xxl training separately from CLIP.
Describe the bug
❕It does work without '--train_text_encoder'. It seems that there might be a memory leak or issue with training the text encoder with the current script / model.
❓Does it make sense that the model uses over 80GB of VRAM?
❓Do you have any recommendations on decreasing VRAM usage
Other than:
. 8bit Adam
. Mixed precision 16fp
. xformers (that doesn't work with SD3.5)
💡Idea:
After successfully training with the Kohya-ss scripts: Relevant Repo,
I have deducted that the issue might be with the Dreambooth scripts here not using 8bitAdam properly; either ignoring or a bug might be in the implementation itself. This is due to the fact that the only single parameter that had a massive effect on VRAM and caused a massive surge is not using Adam8Bit optimizer, otherwise the seemingly same parameters in Kohya-ss.
Kohya-ss Parameters for reference 📝
Reproduction
We are running the following command in Jupyter Notebook:
Logs
System Info
System 🖥️
A100 Azure Remote Server.
Running the code from Jupyter Notebook.
Libraries 📚
Who can help?
No response
The text was updated successfully, but these errors were encountered: