Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Multithread stream #1570

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[RFC] Multithread stream #1570

wants to merge 3 commits into from

Conversation

awni
Copy link
Member

@awni awni commented Nov 7, 2024

A possible implementation of multithreaded streams for the CPU.

I'm not entirely sure if 1. we should add it and 2. If this is a good way. There are some niche cases where it's useful to have a multi-threaded stream (like quantizing on a CPU-only machine).

On an M3 Max quantizing 8B Llama 3 to 4-bit with the CPU:

threads time (seconds)
1 46.94
2 30.27
4 20.52
8 16.39

Interestingly using the CPU-only build is much faster 🤔

threads time (seconds)
1 36.27
2 20.09
4 12.41
8 8.89

With #1578 and 8 threads the CPU quantization time is 4.6 seconds.. which is pretty good for CPU only. Almost as fast as the GPU which is 3.03 seconds on the M3 max.

@awni
Copy link
Member Author

awni commented Nov 8, 2024

Interestingly using the CPU-only build is much faster

It turns out this is related to the memory cache with the GPU build and the BFS using way too much RAM. I see better times (similar to CPU only build) when using the width-limited BFS.

@awni awni force-pushed the multithread_stream branch from 77cbd3f to 4eeae4b Compare November 9, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant