VRAM info #4

C00reNUT · 2024-11-22T16:40:55Z

Small passage about VRAM info would be nice :)

nitinmukesh · 2024-11-22T18:15:52Z

Yeah. I also want to know how much VRAM required for inference.

i-amgeek · 2024-11-22T18:31:48Z

Same question. Would be good to know VRAM usage for various dimensions.

ivanstepanovftw · 2024-11-23T22:15:41Z

8 GiB is not enough 😿

x4080 · 2024-11-24T00:45:02Z

even 16GB is not enough

DsnTgr · 2024-11-24T05:54:49Z

even 24GB is not enough

joseph16388 · 2024-11-24T06:10:15Z

need a 8-bit version

WangRongsheng · 2024-11-24T07:02:58Z

reference it:

x4080 · 2024-11-24T21:31:50Z

Needs 32 GB at least ? Quant anyone ?

KT313 · 2024-11-25T17:31:40Z

I modified the inference script, i made it run with max usage of 15264 MiB of Vram (according to nvtop, inference done with resolution 512x768 and 100 frames). You may need to turn off anything else that uses vram if you're using a 16GiB gpu, but it should work.

i put the modified files here: https://github.com/KT313/LTX_Video_better_vram

it should work if you just drag and drop the files into your LTX-Video folder.

it works by basically offloading everything that is not needed in vram to cpu memory during each of the inference steps.

x4080 · 2024-11-25T21:07:21Z

@KT313 cool, I'll try your solution
Edit : It works, will it need more VRAM if more frames generated ?
Edit2 : It only works 1st time and then it shows error :

ValueError: Cannot generate a cpu tensor from a generator of type cuda.

Edit3 : Now it works again if using suggested resolution (previously I was testing at 384x672, works at 512x768 30 frames and repeated it, dont know why the error above though

Edit4: Error above appears again when using 60 frames, maybe OOM error then

KT313 · 2024-11-26T07:05:08Z

@x4080
i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test
I cannot test it currently though, let me know if that works better

and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues

MarcosRodrigoT · 2024-11-26T14:37:42Z

@x4080 i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test I cannot test it currently though, let me know if that works better

and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues

First of all, thank you for implementing this so that it takes less VRAM. I have tried it out a couple of times (with resolution of 704x480 and for 257 frames) and it works like a charm using only around 16 GB of a 4090 GPU. However, it randomly throws the an error related to "cpu" and "cuda" tensors. Re-running the script usually works, so it is not a big deal.

This was the error:

Traceback (most recent call last):
  File "/home/mrt/Projects/LTX-Video/inference.py", line 452, in <module>
    main()
  File "/home/mrt/Projects/LTX-Video/inference.py", line 356, in main
    images = pipeline(
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/ltx_video/pipelines/pipeline_ltx_video.py", line 1039, in __call__
    noise_pred = self.transformer(
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/ltx_video/models/transformers/transformer3d.py", line 419, in forward
    encoder_hidden_states = self.caption_projection(encoder_hidden_states)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 1607, in forward
    hidden_states = self.linear_1(caption)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

x4080 · 2024-11-26T21:38:22Z

@MarcosRodrigoT Do you use the new test file from @KT313 ? Or the previous one ?
@KT313 is your new test code for multiple GPUs ?

Edit : I tried the test file and it works more frames then previous, but see the same error and retry it and somehow it works, what really is going on - why restarting the command works

Edit2: @KT313 maybe this line is making CUDA and cpu inconsistencies ? (in inference.py)

    if torch.cuda.is_available() and args.disable_load_needed_only:
        pipeline = pipeline.to("cuda")

Edit 4 : I think it works better if above replaced with just

pipeline = pipeline.to("cuda")

to prevent

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

KT313 · 2024-11-27T07:07:38Z

@x4080
i changed the code on the test branch to

    if torch.cuda.is_available():
        pipeline = pipeline.to("cuda")

as you suggested. you might be able to get away with less than 16GiB if you don't load the whole pipeline to cuda in the beginning and first load only the text encoder, then unload it and then load the unet, but that would require more trying around so if your suggestion works it's the easiest for now.

I tried it on single-gpu only (4090). not sure about multi-gpu, but the original code also doesn't have anything that specifically hints towards multi-gpu, at least not in the parts that i modified.

x4080 · 2024-11-27T21:00:10Z

@KT313 thanks

KT313 · 2024-11-28T07:02:38Z

btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor.
Most of the vram currently gets clogged up by the text_embedding model which is comparatively huge. If the text gets embedded to tensors on cpu it might be pretty slow though.

anujsinha72094 · 2024-11-28T16:40:56Z

@KT313 I tried with width:1280,
height:704,
num_frames:201,
fps = 16
The video is fine till 160 frames but after 41 frames it's not good, having noise in frames
why??

KT313 · 2024-11-28T16:43:02Z

@anujsinha72094
pretty unlikely to be related to the changes i made lol

able2608 · 2024-12-01T06:05:14Z

btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor. Most of the vram currently gets clogged up by the text_embedding model which is comparatively huge. If the text gets embedded to tensors on cpu it might be pretty slow though.

It seems that under the hood this uses Pixart alpha's text encoder, which is t5 XXL version 1.1. There currently exists gguf versions of these models, which flux from blackforest also uses. I have been able to generate images with flux with such setup (loading t5 in gguf mode and offload it after text encoding) successfully on a laptop GPU with 6G VRAM and 16G RAM. Perhaps using such method could reduce the memory requirements by a lot (to at least be able to run it on limit resources).

PS: technically t5 XXL and t5 XXL V1.1 has some differences beside training strategies, mainly on activation and parameter sharing between embedding and classification. I have not tested out on whether this will increase memory usage, but since the aforementioned changes are relatively minor, I do think that the experience on t5 XXL can be extrapolated.

Edit: It seems that the comfyui integration uses separate nodes for text encoder loading and diffuser loading. Perhaps a good point to start would be to replace the text encoder loader from the official repo with the gguf clip loader provided by city96's GGUF nodes and see whether it works or not. For those who have problem finding the gguf loaders, the repo's link is as follows: https://github.com/city96/ComfyUI-GGUF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM info #4

VRAM info #4

C00reNUT commented Nov 22, 2024

nitinmukesh commented Nov 22, 2024

i-amgeek commented Nov 22, 2024

ivanstepanovftw commented Nov 23, 2024

x4080 commented Nov 24, 2024

DsnTgr commented Nov 24, 2024

joseph16388 commented Nov 24, 2024

WangRongsheng commented Nov 24, 2024

x4080 commented Nov 24, 2024

KT313 commented Nov 25, 2024 •

edited

Loading

x4080 commented Nov 25, 2024 •

edited

Loading

KT313 commented Nov 26, 2024 •

edited

Loading

MarcosRodrigoT commented Nov 26, 2024 •

edited

Loading

x4080 commented Nov 26, 2024 •

edited

Loading

KT313 commented Nov 27, 2024

x4080 commented Nov 27, 2024

KT313 commented Nov 28, 2024

anujsinha72094 commented Nov 28, 2024 •

edited

Loading

KT313 commented Nov 28, 2024

able2608 commented Dec 1, 2024 •

edited

Loading

VRAM info #4

VRAM info #4

Comments

C00reNUT commented Nov 22, 2024

nitinmukesh commented Nov 22, 2024

i-amgeek commented Nov 22, 2024

ivanstepanovftw commented Nov 23, 2024

x4080 commented Nov 24, 2024

DsnTgr commented Nov 24, 2024

joseph16388 commented Nov 24, 2024

WangRongsheng commented Nov 24, 2024

x4080 commented Nov 24, 2024

KT313 commented Nov 25, 2024 • edited Loading

x4080 commented Nov 25, 2024 • edited Loading

KT313 commented Nov 26, 2024 • edited Loading

MarcosRodrigoT commented Nov 26, 2024 • edited Loading

x4080 commented Nov 26, 2024 • edited Loading

KT313 commented Nov 27, 2024

x4080 commented Nov 27, 2024

KT313 commented Nov 28, 2024

anujsinha72094 commented Nov 28, 2024 • edited Loading

KT313 commented Nov 28, 2024

able2608 commented Dec 1, 2024 • edited Loading

KT313 commented Nov 25, 2024 •

edited

Loading

x4080 commented Nov 25, 2024 •

edited

Loading

KT313 commented Nov 26, 2024 •

edited

Loading

MarcosRodrigoT commented Nov 26, 2024 •

edited

Loading

x4080 commented Nov 26, 2024 •

edited

Loading

anujsinha72094 commented Nov 28, 2024 •

edited

Loading

able2608 commented Dec 1, 2024 •

edited

Loading