Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will it support CPU offloading? #578

Open
fzyzcjy opened this issue Jul 30, 2024 · 5 comments
Open

Will it support CPU offloading? #578

fzyzcjy opened this issue Jul 30, 2024 · 5 comments

Comments

@fzyzcjy
Copy link

fzyzcjy commented Jul 30, 2024

Hi, thanks for the great library! I have heard some people saying EXL2 being very fast, but I would like to try the 70B llama model on a 24GB 4090 card, so it cannot be fit into the GPU using e.g. 4bit quantization. Therefore, I wonder whether there are some theoretical limitation on EXL2/GPTQ, or it is just because it has not been implemented yet? Thanks!

@turboderp
Copy link
Owner

PRs are welcome. 🤷 I just have too many feature requests and too little time.

@fzyzcjy
Copy link
Author

fzyzcjy commented Jul 31, 2024

I see. I may not have enough time to do this recently (I have my open source lib to maintain e.g. https://github.com/fzyzcjy/flutter_rust_bridge as well as doing research and projects and etc), but looking forward to the feature, and again thank you for the great job!

@grimulkan
Copy link

Technically, it is possible to do it I think. Just need to call module.load() before module.forward(...) followed by module.unload(), layer by layer, while keeping the cache in VRAM (can even use quantized cache, and chunked forward with multiple passes if you have a long context length). It'll be terribly slow for token-by-token inference though.

@ro99
Copy link

ro99 commented Aug 8, 2024

Maybe this would be good to have for scenarios where the model almost fits in the VRAM? I don't know how much of a performance penalty we would have in this case though.

Willing to do a PR, but to be honest, not sure where to begin from. I'd appreciate any directions.

@grimulkan
Copy link

You can look at the streaming example in test_inference.py to start with. It is also easy to modify the non-streaming forward function to include module loading/unloading like the streaming version, so that it supports more features like chunked forward.

But these are for perplexity eval or for offline forward passes. Even if you had a single layer stuck in CPU, it would still be pretty bad I think for token by token inference. You'd basically have to load/unload entire layer(s) for every token generated. At that point, why not just use CPU inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants