-
-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curious about Exllama+TP #571
Comments
It would be a substantial amount of work. It's complicated somewhat by the EXL2 tensor format, but that's not a showstopper. I believe aphrodite-engine has a TP solution for EXL2 models already. The main problem is I don't have the time for it. There's never a moment where there isn't something else that calls for my attention, like new architectures dropping constantly, and to make it really efficient it probably requires a complete rewrite of the inference engine in C++ or Rust. And of course there's the question of who this would be for, exactly. As in, how many users even have multiple GPUs, let alone with the connectivity (server motherboards, server CPUs) for all the synchronization to not become a bottleneck and defeat the purpose anyway. And how many of those users wouldn't be better served with something like vLLM anyway? You mention cache quantization, for instance, which isn't generally a concern in very large deployments but rather something you'd turn to when you have limited resources, as on a consumer motherboard with one or two GPUs where TP isn't really relevant anyway. All in all I'm skeptical if all that effort would end up servicing a very narrow band of users, with very high-end hardware but not so high-end that they don't have lots of other options already. |
I do think there's a gap between systems that cost perhaps tens of thousands of USD vs 6 figures and this might fit in between, but I understand your position. And yes, it is a niche user base. Mainly the application is getting fast inference on a single node at max context on 405B without data center GPUs (sever platform, though that's much cheaper than 8 x100 GPUs). You provide this already, but TP could be faster. Cache (and weight) quantization is needed for max context in that case. I was mainly asking about how much work it would take for an independent project based on exllama kernels and the EXL2 format, but it sounds like a lot of work. I've written Llama TP inference implementations in bare pytorch and it wasn't staggeringly difficult, but it isn't as efficient as exllama and didn't use any custom C++/CUDA kernels (other than third party like flash-attn), and nothing was quantized. I thought if it was a simple matter of changing just the matmuls and dealing with the synchronization it may not be months of work (for me, not you). But from what you said, sounds like I'd be better off checking out aphrodite, didn't know they supported EXL2. |
Thinking about this a bit more. Would appreciate any thoughts. I may not have explained this well in my original post, but the request is NOT to have exllama support TP. It's just some questions about whether the building blocks to build a TP inference pipeline exist in exllama currently (if someone wanted to build one for whatever reason). Re-opening just to indicate the clarification of scope (I don't expect any actual action).
Any thoughts on why this approach would be much harder that I'm thinking? Maybe I'm missing something about where kernels would need to be re-written, or under-estimating the importance of some of the kernel fusion in exllama. Again, this is work for me (or any other user), not necessarily the exllama dev :) |
There's a unique complication to EXL2 which is the act-order permutation (the matmul kernel actually performs x @ W = x[:, perm] @ W[perm, :]), but that amounts to replacing a few scatter operations with broadcasts and probably doesn't add that much overhead. Also, depending on whether you do row or column splits you might have to add striding support to the kernels. They currently assume the inputs and outputs are all contiguous. Then you need to worry about streams and multithreading, which just hasn't been taken into account in any of the existing code. It all runs on the default stream in a single thread, both on the Python side and in the C++ modules that do most of the operations of an attention or FFN block in pure C++/CUDA to avoid Python becoming a CPU bottleneck for small models (disregard what people might say about Python being fast in theory; reality begs to differ). All of that would have to be reworked. You'd need a new loader as well to distribute tensors across GPUs, which isn't necessarily hard, it's just a lot of boilerplate. Broadcast/scatter/gather operations are themselves nontrivial since you need to really minimize the latency for TP to even make sense. The state has to be synchronized at least four times per layer, possibly a few more because of the permutation, so you can't afford more than a few microseconds per operation. You'd want to work it into graphs ideally, use a library like NCCL, all kind of complicated to just tack onto the existing codebase. Ideally you'd want this all to be robust and user-friendly, too. I'm not terribly interested in a solution that only works if you have 2^n identical GPUs, all of which have to be headless, or maybe requires specific kernel flags in order to work, or whatever. So yeah, it's definitely not impossible, and in principle it's simple: But it just turns into a lot of work in practice, and I personally don't have infinite time. |
No problem, just need to pick your brain, nothing more.
I was thinking W would be split into W1 and W2, either row-wise or column wise depending on the op (up/gate proj could be row split, down proj could be column split for egs). This would be done at the quant stage itself, so it ties the quant to the # of GPUs (but it can be an arbitrary number, not necessarily power of 2). So GPU1 does x @ W1 (x[:, perm] @ W1[:, perm]), since W1 was basically independently quantized, and GPU2 does x @ W2. The weights are basically totally independent in terms of quantization, act order etc, and are considered separate operations (you probably lose some quantization efficiency this way). After the matmuls, we either allgather, or if continuing to process egs., downproj, can continue matmuls and do allreduce in the end to compute the final result. The scatter/gather ops are all unquantized, since they only exchange results, not weights (except initially synching the KV cache itself I guess). Would that not work (especially the part about splitting matrices and then quantizing separately)?
That miiight be okay... Each GPU only really needs a single default compute stream, and the synchronization streams can run separately and sync with the default stream. Are you doing anything intensive in the C++ part that will block the python calling thread until the CUDA operations are completed? If not, then maybe we really just need the jobs submitted to the CUDA scheduler. For a larger model, the entire thing could even by run in a single python thread (with synchronization IO running on their own threads which won't block python), as long as we can queue up the CUDA operations, and the CUDA operations are not too tiny, and still get decent throughput. I think this will work great while training and probably works fine for prompt ingestion too, but maybe your point is the CUDA ops are really very tiny for inference, and the launch/scheduling overhead from a single thread will be the showstop?
In the above comment, perhaps you're implying that it will be a showstop. Would that be true for a larger model as well (you have twice the embed dim in 405B, and the FFN parts are massive)? You're right, in that if we can't pull out individual matmuls and the flash attention calls from your C++ kernel in to python without a huge penalty it becomes a giant pain.
Possibly we could skip the permutation part if my understanding for W1 | W2 split is correct, but it would still need a few syncs, yes, and its probably not possible to hide that latency via overlap because there isn't enough compute work. Each sync is something like embed dim * fp16 bytes of transfer for the current token, and obviously way slower than HBM access. Have to work out if it is worth it or not: for N GPUs, you basically get N times the benefit for the weights in HBM accesses, but pay the penalty of slow ~5*dmodel transfers. For the FFNs, I think it is worth it, have to think about the rest.
Yeah that's a pain in the butt, I agree (and I have a secret desire to support more than just NCCL so it works on Windows without WSL, which is even more complicated). But let's ignore that bit. I'm wondering mainly about the compute operations and if there's a big roadblock there. |
So I've been working on a tacked-on, half-hearted TP implementation for the last week or so, and I do have it running for at least L3 and Mistral. I went with a simple column-parallel approach and an extra all-gather in both attn and MLP layers for simplicity at first (and for the sake of the permuted tensors). While this does add an extra bit of overhead, there's really very minimal time spent moving data around, according to Nsight. What I'm mostly struggling with is the CPU overhead. Having to build n CUDA streams instead of one is painful. Python isn't a bottleneck for large models generally, but when you more than triple the number of operations in each module and waste precious microseconds every time you want to schedule recording an event and so on, it really adds up. So, for L3-70B 6.0bpw on 3 GPUs I can get up to about 50% utilization, which is better than the 33% baseline, and 50% more tokens/second at the end of the day, but far from the 200% increase it should theoretically approach. The bottleneck is, once again, Python. So I need to move more stuff into extension code, and possibly add some multithreading on that side since the CUDA runtime looks like it will become the next bottleneck after that. (I.e. I can squeeze out some of the empty space on the "CUDA API" line, but kernel launches still take a couple of µs each, and moving to graphs there likely wouldn't help.) Launching flash-attn from Python takes something like 15 µs, so doing that four times per layer over 80 layers is 5 ms of latency all on its own. At 20 t/s, 10% of the CPU time is spent just calling the flash-attn function, so you quickly hit a ceiling with all the other operations that also have to happen during the forward pass. Of course it might be possible to call it from the C++ extension, but then I'm not sure the flash-attn C++ ABI is stable so that could get kinda ugly. Anyway, yeah, just an update. I'll probably push an experimental branch very soon. I'm doing all the broadcasting/gathering via pinned buffers and synchronizing with events, so I don't see a reason it wouldn't work on Windows. |
Excellent news! Will be happy to test any experimental branch, scaling with large # of GPUs, Windows and/or Linux.
Maybe multiprocessing with the same parent process is an option for Python building multiple streams?
Yeah, that's what I was thinking would be enough to avoid NCCL (assuming you used this for all scatter/gather), as well as avoid multiple python processes running, but pinned buffers have bitten me in Windows when too big. Windows is weird with non-paged memory.
Well you pretty concretely answered my question about how much Python & launch latency matters :) Guess my naive view of running everything with Python multithreading won't be good enough if you're blocked by more than just IO (and the GIL-free multithreading in 3.13 is still quite unstable). |
Finally got around to testing Llama 3.1 405B on 8 x Ada6000 @ PCIe Gen4 x8 (6-bit EXL2 quant) Non-TP TP Speeds are similar with or without Q4 cache. TP works fine with existing wrappers with minor modification too (egs., Ooba's text generation webui & exllama_hf). Output text seems identical with or without TP as far as I can tell. Thanks for the work turbo! I believe this would be the fastest way currently to run 405B quantized single batch inference. Are you planning on working on this any further? Is it still mostly CPU bottlenecked (relative to the theoretical 8x)? |
I am planning on improving it further, but I keep getting caught up in other things. The main bottlenecks right now are PCIe bandwidth and CPU speed. Bandwidth is an issue because the synchronization approach isn't optimal. This is proving difficult because I don't want to make a solution tailored for enterprise hardware where you can rely on P2P capability and even splits across 2^n GPUs and so on. NCCL offers an all-gather function, for instance, but it can only work on tensors that are precisely evenly split between GPUs, which eliminates a lot of the flexibility the current approach offers. Potentially I could make do with only all-reduce operations, and theoretically that should be faster, but I haven't found a way to make it actually faster in a single-process setting without P2P. The CPU bottleneck comes from the CUDA API functions not being reentrant. I.e. when you spread the workload over 8 GPUs you end up with 8x as many kernel launches, and spreading those over 8 threads does nothing to improve performance since only one thread can interact with the CUDA runtime at any given time. I'm not sure what the solution would be, but it's probably writing a whole new executor that spawns a process per device so each process can have its own CUDA runtime. So Loading is expected to be a bit slower since it still loads each tensor on a single GPU, then it splits it from there to the others, so you have the same disk access pattern but a little more time sending data over the bus. |
Thanks. Personally, I’d be glad to have an optional P2P version, even if for power of 2 GPUs and not for Windows, if it doesn’t result in a lot of code duplication on your side. I’ll test again with x16 pcie and see how much impact that has. |
TP with 8 x16 Gen4 PCie on a single NUMA node Basically the same, despite twice the P2P/H2D/D2H bandwidth. So the bottleneck is somewhere else like CPU maybe, or in the P2P software overhead? I should add that all this is with the v2 version of Llama 3.1 405B (with the extra attn heads removed), and I set EDIT: Looking closer, looks like most of the communication is actually happening via the host and pinned buffers. Is that correct? It could be that I am maxing out my DRAM bandwidth and that's why it didn't speed up, despite the interface bandwidth doubling. Any interest in using something like |
I tested with >8 GPUs on one node, and there may actually be an advantage to using pinned buffers and host-based transfers instead of P2P: it bypass the limit of 8 GPUs per peer group. However the initial loading process where the first GPU loads and chunks the tensors and passes them to the other GPUs seem to go through P2P, and crash for more than 8 GPUs. Mentioning here in case anyone else tries it and wonders what is going on. I'll try to prototype fixes for both this and the P2P to see if it is worth turbo's time to address, and report here. |
Fixing loads for > 8 GPUs was easy:
TP with 10 x ADA 6000s w/ x16 Gen4 PCie each on a single NUMA node That's actually slower than 8 GPUs. Coming to think of it, 8 GPUs might even be slower than 6 GPUs, but I can't test it because the model won't fit. So one of the other factors doesn't scale. Still, it seems worth it to at least support TP > 8 GPUs if it should ever make sense (egs., to fit a larger model), and it's a 1-line change. @turboderp can you integrate the above fix? Or entertain a tiny PR? The 8 GPU peer limit is an old CUDA limitation, I don't see NVidia moving away from that in the near future. Getting P2P to work for inference will take some more work, will try that next (it will have the 8 GPU limit). EDIT: Also tried with 12 GPUs: |
After monkeying with it a bit, I found that I didn't try prototyping So... that's that. The bottlenecks are maybe not DRAM or PCIe bandwidth anymore, but the other things turbo mentioned. Maybe with NVLink it makes a difference, but if you have that, you likely have multiple 80GB GPUs and have no need for Exllama. |
How hard would it be to write an inference engine based on exllama that supported tensor parallel, using the existing building blocks?
Assume the quantized weight tensors would need to be split across the GPUs (either column or row-wise), and that the non-quantized pieces (hidden tensors) and any KV cache chunks (which could be quantized) are replicated/exchanged via P2P like any other TP implementation. I think the actual attention computation would be like any other TP, so its mainly the MLP and QKV projections.
Would EXL2 quantization of weights need to be done differently with the splitting in mind?
Would it be possible for a user to implement inference with existing functions in exllama?
Assuming single-batch inference, and ignoring prompt ingestion time, think it would be worth it to improve tok/s? Basically get better effective VRAM bandwidth.
Other solutions like vLLM do this, but don't have as many quantization options for weights and KV cache, and I don't think have the ability to run on Windows (whatever that's worth). And they may not be optimized for single-batch.
The text was updated successfully, but these errors were encountered: