Will it support CPU offloading? #578

fzyzcjy · 2024-07-30T12:46:42Z

Hi, thanks for the great library! I have heard some people saying EXL2 being very fast, but I would like to try the 70B llama model on a 24GB 4090 card, so it cannot be fit into the GPU using e.g. 4bit quantization. Therefore, I wonder whether there are some theoretical limitation on EXL2/GPTQ, or it is just because it has not been implemented yet? Thanks!

turboderp · 2024-07-30T19:08:05Z

PRs are welcome. 🤷 I just have too many feature requests and too little time.

fzyzcjy · 2024-07-31T00:49:28Z

I see. I may not have enough time to do this recently (I have my open source lib to maintain e.g. https://github.com/fzyzcjy/flutter_rust_bridge as well as doing research and projects and etc), but looking forward to the feature, and again thank you for the great job!

grimulkan · 2024-08-04T02:19:36Z

Technically, it is possible to do it I think. Just need to call module.load() before module.forward(...) followed by module.unload(), layer by layer, while keeping the cache in VRAM (can even use quantized cache, and chunked forward with multiple passes if you have a long context length). It'll be terribly slow for token-by-token inference though.

ro99 · 2024-08-08T10:35:24Z

Maybe this would be good to have for scenarios where the model almost fits in the VRAM? I don't know how much of a performance penalty we would have in this case though.

Willing to do a PR, but to be honest, not sure where to begin from. I'd appreciate any directions.

grimulkan · 2024-08-08T16:29:47Z

You can look at the streaming example in test_inference.py to start with. It is also easy to modify the non-streaming forward function to include module loading/unloading like the streaming version, so that it supports more features like chunked forward.

But these are for perplexity eval or for offline forward passes. Even if you had a single layer stuck in CPU, it would still be pretty bad I think for token by token inference. You'd basically have to load/unload entire layer(s) for every token generated. At that point, why not just use CPU inference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will it support CPU offloading? #578

Will it support CPU offloading? #578

fzyzcjy commented Jul 30, 2024

turboderp commented Jul 30, 2024

fzyzcjy commented Jul 31, 2024 •

edited

Loading

grimulkan commented Aug 4, 2024

ro99 commented Aug 8, 2024

grimulkan commented Aug 8, 2024

Will it support CPU offloading? #578

Will it support CPU offloading? #578

Comments

fzyzcjy commented Jul 30, 2024

turboderp commented Jul 30, 2024

fzyzcjy commented Jul 31, 2024 • edited Loading

grimulkan commented Aug 4, 2024

ro99 commented Aug 8, 2024

grimulkan commented Aug 8, 2024

fzyzcjy commented Jul 31, 2024 •

edited

Loading