Curious about Exllama+TP #571

grimulkan · 2024-07-25T21:43:45Z

How hard would it be to write an inference engine based on exllama that supported tensor parallel, using the existing building blocks?

Assume the quantized weight tensors would need to be split across the GPUs (either column or row-wise), and that the non-quantized pieces (hidden tensors) and any KV cache chunks (which could be quantized) are replicated/exchanged via P2P like any other TP implementation. I think the actual attention computation would be like any other TP, so its mainly the MLP and QKV projections.

Would EXL2 quantization of weights need to be done differently with the splitting in mind?
Would it be possible for a user to implement inference with existing functions in exllama?

Assuming single-batch inference, and ignoring prompt ingestion time, think it would be worth it to improve tok/s? Basically get better effective VRAM bandwidth.

Other solutions like vLLM do this, but don't have as many quantization options for weights and KV cache, and I don't think have the ability to run on Windows (whatever that's worth). And they may not be optimized for single-batch.

turboderp · 2024-07-26T01:50:57Z

It would be a substantial amount of work. It's complicated somewhat by the EXL2 tensor format, but that's not a showstopper. I believe aphrodite-engine has a TP solution for EXL2 models already.

The main problem is I don't have the time for it. There's never a moment where there isn't something else that calls for my attention, like new architectures dropping constantly, and to make it really efficient it probably requires a complete rewrite of the inference engine in C++ or Rust.

And of course there's the question of who this would be for, exactly. As in, how many users even have multiple GPUs, let alone with the connectivity (server motherboards, server CPUs) for all the synchronization to not become a bottleneck and defeat the purpose anyway. And how many of those users wouldn't be better served with something like vLLM anyway? You mention cache quantization, for instance, which isn't generally a concern in very large deployments but rather something you'd turn to when you have limited resources, as on a consumer motherboard with one or two GPUs where TP isn't really relevant anyway.

All in all I'm skeptical if all that effort would end up servicing a very narrow band of users, with very high-end hardware but not so high-end that they don't have lots of other options already.

grimulkan · 2024-07-26T02:12:39Z

I do think there's a gap between systems that cost perhaps tens of thousands of USD vs 6 figures and this might fit in between, but I understand your position. And yes, it is a niche user base.

Mainly the application is getting fast inference on a single node at max context on 405B without data center GPUs (sever platform, though that's much cheaper than 8 x100 GPUs). You provide this already, but TP could be faster. Cache (and weight) quantization is needed for max context in that case.

I was mainly asking about how much work it would take for an independent project based on exllama kernels and the EXL2 format, but it sounds like a lot of work. I've written Llama TP inference implementations in bare pytorch and it wasn't staggeringly difficult, but it isn't as efficient as exllama and didn't use any custom C++/CUDA kernels (other than third party like flash-attn), and nothing was quantized. I thought if it was a simple matter of changing just the matmuls and dealing with the synchronization it may not be months of work (for me, not you). But from what you said, sounds like I'd be better off checking out aphrodite, didn't know they supported EXL2.

grimulkan · 2024-07-26T21:52:21Z

Thinking about this a bit more. Would appreciate any thoughts.

I may not have explained this well in my original post, but the request is NOT to have exllama support TP. It's just some questions about whether the building blocks to build a TP inference pipeline exist in exllama currently (if someone wanted to build one for whatever reason). Re-opening just to indicate the clarification of scope (I don't expect any actual action).

Each decoder weight matrix (Q, K, V, Up, Down, Gate) can be pre-divided either row or column-wise in the quantization process to be independently quantizable. They are effectively separate matrices for quantization purposes (but still quantized activation aware with a forward pass that takes that into account).
I think all the forward pass matmuls in exllama will still work, and the synchronization & summations are handled externally (the residual is 16-bit anyway I think). KV cache can still be quantized (and will probably need to be replicated on each GPU for good throughput).
Attention, embed, lm_head are processed outside the exllama framework as any other TP.

Any thoughts on why this approach would be much harder that I'm thinking? Maybe I'm missing something about where kernels would need to be re-written, or under-estimating the importance of some of the kernel fusion in exllama.

Again, this is work for me (or any other user), not necessarily the exllama dev :)

turboderp · 2024-07-27T04:05:57Z

There's a unique complication to EXL2 which is the act-order permutation (the matmul kernel actually performs x @ W = x[:, perm] @ W[perm, :]), but that amounts to replacing a few scatter operations with broadcasts and probably doesn't add that much overhead.

Also, depending on whether you do row or column splits you might have to add striding support to the kernels. They currently assume the inputs and outputs are all contiguous.

Then you need to worry about streams and multithreading, which just hasn't been taken into account in any of the existing code. It all runs on the default stream in a single thread, both on the Python side and in the C++ modules that do most of the operations of an attention or FFN block in pure C++/CUDA to avoid Python becoming a CPU bottleneck for small models (disregard what people might say about Python being fast in theory; reality begs to differ). All of that would have to be reworked.

You'd need a new loader as well to distribute tensors across GPUs, which isn't necessarily hard, it's just a lot of boilerplate.

Broadcast/scatter/gather operations are themselves nontrivial since you need to really minimize the latency for TP to even make sense. The state has to be synchronized at least four times per layer, possibly a few more because of the permutation, so you can't afford more than a few microseconds per operation. You'd want to work it into graphs ideally, use a library like NCCL, all kind of complicated to just tack onto the existing codebase.

Ideally you'd want this all to be robust and user-friendly, too. I'm not terribly interested in a solution that only works if you have 2^n identical GPUs, all of which have to be headless, or maybe requires specific kernel flags in order to work, or whatever.

So yeah, it's definitely not impossible, and in principle it's simple:

But it just turns into a lot of work in practice, and I personally don't have infinite time.

grimulkan · 2024-07-27T05:08:25Z

No problem, just need to pick your brain, nothing more.

There's a unique complication to EXL2 which is the act-order permutation (the matmul kernel actually performs x @ W = x[:, perm] @ W[perm, :]), but that amounts to replacing a few scatter operations with broadcasts and probably doesn't add that much overhead.

I was thinking W would be split into W1 and W2, either row-wise or column wise depending on the op (up/gate proj could be row split, down proj could be column split for egs). This would be done at the quant stage itself, so it ties the quant to the # of GPUs (but it can be an arbitrary number, not necessarily power of 2).

So GPU1 does x @ W1 (x[:, perm] @ W1[:, perm]), since W1 was basically independently quantized, and GPU2 does x @ W2. The weights are basically totally independent in terms of quantization, act order etc, and are considered separate operations (you probably lose some quantization efficiency this way).

After the matmuls, we either allgather, or if continuing to process egs., downproj, can continue matmuls and do allreduce in the end to compute the final result. The scatter/gather ops are all unquantized, since they only exchange results, not weights (except initially synching the KV cache itself I guess).

Would that not work (especially the part about splitting matrices and then quantizing separately)?

Then you need to worry about streams and multithreading, which just hasn't been taken into account in any of the existing code. It all runs on the default stream in a single thread, both on the Python side and in the C++ modules that do most of the operations of an attention or FFN block in pure C++/CUDA to avoid Python becoming a CPU bottleneck for small models (disregard what people might say about Python being fast in theory; reality begs to differ). All of that would have to be reworked.

That miiight be okay... Each GPU only really needs a single default compute stream, and the synchronization streams can run separately and sync with the default stream. Are you doing anything intensive in the C++ part that will block the python calling thread until the CUDA operations are completed? If not, then maybe we really just need the jobs submitted to the CUDA scheduler. For a larger model, the entire thing could even by run in a single python thread (with synchronization IO running on their own threads which won't block python), as long as we can queue up the CUDA operations, and the CUDA operations are not too tiny, and still get decent throughput.

I think this will work great while training and probably works fine for prompt ingestion too, but maybe your point is the CUDA ops are really very tiny for inference, and the launch/scheduling overhead from a single thread will be the showstop?

C++ modules that do most of the operations of an attention or FFN block in pure C++/CUDA to avoid Python becoming a CPU bottleneck for small models (disregard what people might say about Python being fast in theory; reality begs to differ). All of that would have to be reworked.

In the above comment, perhaps you're implying that it will be a showstop. Would that be true for a larger model as well (you have twice the embed dim in 405B, and the FFN parts are massive)?

You're right, in that if we can't pull out individual matmuls and the flash attention calls from your C++ kernel in to python without a huge penalty it becomes a giant pain.

The state has to be synchronized at least four times per layer, possibly a few more because of the permutation, so you can't afford more than a few microseconds per operation.

Possibly we could skip the permutation part if my understanding for W1 | W2 split is correct, but it would still need a few syncs, yes, and its probably not possible to hide that latency via overlap because there isn't enough compute work. Each sync is something like embed dim * fp16 bytes of transfer for the current token, and obviously way slower than HBM access.

Have to work out if it is worth it or not: for N GPUs, you basically get N times the benefit for the weights in HBM accesses, but pay the penalty of slow ~5*dmodel transfers. For the FFNs, I think it is worth it, have to think about the rest.

You'd want to work it into graphs ideally, use a library like NCCL, all kind of complicated to just tack onto the existing codebase.

Yeah that's a pain in the butt, I agree (and I have a secret desire to support more than just NCCL so it works on Windows without WSL, which is even more complicated). But let's ignore that bit. I'm wondering mainly about the compute operations and if there's a big roadblock there.

turboderp · 2024-08-08T12:14:59Z

So I've been working on a tacked-on, half-hearted TP implementation for the last week or so, and I do have it running for at least L3 and Mistral. I went with a simple column-parallel approach and an extra all-gather in both attn and MLP layers for simplicity at first (and for the sake of the permuted tensors). While this does add an extra bit of overhead, there's really very minimal time spent moving data around, according to Nsight.

What I'm mostly struggling with is the CPU overhead. Having to build n CUDA streams instead of one is painful. Python isn't a bottleneck for large models generally, but when you more than triple the number of operations in each module and waste precious microseconds every time you want to schedule recording an event and so on, it really adds up.

So, for L3-70B 6.0bpw on 3 GPUs I can get up to about 50% utilization, which is better than the 33% baseline, and 50% more tokens/second at the end of the day, but far from the 200% increase it should theoretically approach. The bottleneck is, once again, Python. So I need to move more stuff into extension code, and possibly add some multithreading on that side since the CUDA runtime looks like it will become the next bottleneck after that. (I.e. I can squeeze out some of the empty space on the "CUDA API" line, but kernel launches still take a couple of µs each, and moving to graphs there likely wouldn't help.)

Launching flash-attn from Python takes something like 15 µs, so doing that four times per layer over 80 layers is 5 ms of latency all on its own. At 20 t/s, 10% of the CPU time is spent just calling the flash-attn function, so you quickly hit a ceiling with all the other operations that also have to happen during the forward pass. Of course it might be possible to call it from the C++ extension, but then I'm not sure the flash-attn C++ ABI is stable so that could get kinda ugly.

Anyway, yeah, just an update. I'll probably push an experimental branch very soon. I'm doing all the broadcasting/gathering via pinned buffers and synchronizing with events, so I don't see a reason it wouldn't work on Windows.

grimulkan · 2024-08-08T16:43:48Z

Excellent news! Will be happy to test any experimental branch, scaling with large # of GPUs, Windows and/or Linux.

Having to build n CUDA streams instead of one is painful.

Maybe multiprocessing with the same parent process is an option for Python building multiple streams?
EDIT: Actually, you probably lose what you gain with CPU synchronization overhead (I think there is no way in CUDA to wait on future submitted work via events, necessitating CPU sync after scheduling). Also, on windows the processes will spawn instead of fork.

via pinned buffers and synchronizing with events, so I don't see a reason it wouldn't work on Windows.

Yeah, that's what I was thinking would be enough to avoid NCCL (assuming you used this for all scatter/gather), as well as avoid multiple python processes running, but pinned buffers have bitten me in Windows when too big. Windows is weird with non-paged memory.

cudaMemcpyPeer() or letting pytorch's to() to move data using P2P, if the user explicitly allows it, might be a decent idea. It should be faster than pinned buffers via CPU if there is support (though exllama + TCC mode + recent drivers is broken in Windows right now, Nvidia's fault, so maybe not worth it in Windows anyway).

Well you pretty concretely answered my question about how much Python & launch latency matters :) Guess my naive view of running everything with Python multithreading won't be good enough if you're blocked by more than just IO (and the GIL-free multithreading in 3.13 is still quite unstable).

grimulkan · 2024-09-13T19:14:17Z

Finally got around to testing Llama 3.1 405B on 8 x Ada6000 @ PCIe Gen4 x8 (6-bit EXL2 quant)

Non-TP
Model load time: 02:44 min (from Gen4 NVMe, loads onto one GPU at a time)
Prompt pre-load: 245 t/s @ 4K context (it was mostly using a single CPU core)
Inference: 2.9 t/s @ 4K context

TP
Model load time: 03:14 min (loads all GPUs at the same time, not sure why it's a bit slower, but it's very repeatable)
Prompt pre-load: 356 t/s (close to ~1.5x faster, still using single CPU core. The speedup ratio goes down as context length increases, and up as it decreases.)
Inference: 8.6 t/s @ 4K context (close to ~3x faster, I get about this ratio all across the context length to 128K)

Speeds are similar with or without Q4 cache.

TP works fine with existing wrappers with minor modification too (egs., Ooba's text generation webui & exllama_hf). Output text seems identical with or without TP as far as I can tell.

Thanks for the work turbo! I believe this would be the fastest way currently to run 405B quantized single batch inference.

Are you planning on working on this any further? Is it still mostly CPU bottlenecked (relative to the theoretical 8x)?

turboderp · 2024-09-13T23:45:41Z

I am planning on improving it further, but I keep getting caught up in other things. The main bottlenecks right now are PCIe bandwidth and CPU speed.

Bandwidth is an issue because the synchronization approach isn't optimal. This is proving difficult because I don't want to make a solution tailored for enterprise hardware where you can rely on P2P capability and even splits across 2^n GPUs and so on. NCCL offers an all-gather function, for instance, but it can only work on tensors that are precisely evenly split between GPUs, which eliminates a lot of the flexibility the current approach offers. Potentially I could make do with only all-reduce operations, and theoretically that should be faster, but I haven't found a way to make it actually faster in a single-process setting without P2P.

The CPU bottleneck comes from the CUDA API functions not being reentrant. I.e. when you spread the workload over 8 GPUs you end up with 8x as many kernel launches, and spreading those over 8 threads does nothing to improve performance since only one thread can interact with the CUDA runtime at any given time. I'm not sure what the solution would be, but it's probably writing a whole new executor that spawns a process per device so each process can have its own CUDA runtime. So rm -r ./exllamav2/ and start over from scratch. Hopefully I can find some other way to tackle it.

Loading is expected to be a bit slower since it still loads each tensor on a single GPU, then it splits it from there to the others, so you have the same disk access pattern but a little more time sending data over the bus.

grimulkan · 2024-09-14T00:50:49Z

Thanks. Personally, I’d be glad to have an optional P2P version, even if for power of 2 GPUs and not for Windows, if it doesn’t result in a lot of code duplication on your side.

I’ll test again with x16 pcie and see how much impact that has.

grimulkan · 2024-10-02T00:48:43Z

TP with 8 x16 Gen4 PCie on a single NUMA node
Prompt pre-load: 362 t/s
Inference: 8.7 t/s @ 4K context

Basically the same, despite twice the P2P/H2D/D2H bandwidth. So the bottleneck is somewhere else like CPU maybe, or in the P2P software overhead?

I should add that all this is with the v2 version of Llama 3.1 405B (with the extra attn heads removed), and I set self.max_dq_size = 2*(1024**3) (otherwise pre-load slows to a crawl).

EDIT: Looking closer, looks like most of the communication is actually happening via the host and pinned buffers. Is that correct? It could be that I am maxing out my DRAM bandwidth and that's why it didn't speed up, despite the interface bandwidth doubling.

Any interest in using something like cudaMemcpyPeer() optionally? That would bypass this bottleneck (if it is one), and still doesn't require NCCL or 2^n GPUs. Basically the same as using pytorch .to() between devices, on separate CUDA streams with CUDA events to synchronize.

grimulkan · 2024-10-18T18:07:56Z

I tested with >8 GPUs on one node, and there may actually be an advantage to using pinned buffers and host-based transfers instead of P2P: it bypass the limit of 8 GPUs per peer group.

However the initial loading process where the first GPU loads and chunks the tensors and passes them to the other GPUs seem to go through P2P, and crash for more than 8 GPUs. Mentioning here in case anyone else tries it and wonders what is going on.

I'll try to prototype fixes for both this and the P2P to see if it is worth turbo's time to address, and report here.

grimulkan · 2024-10-18T22:43:01Z

Fixing loads for > 8 GPUs was easy:

def test_gpu_peer_copy(
    device_a: torch.Device,
    device_b: torch.Device
):
    if abs(device_a.index - device_b.index) > 8: return False # No peer copy across 8 GPU boundary
    ...

TP with 10 x ADA 6000s w/ x16 Gen4 PCie each on a single NUMA node
Prompt pre-load: 232 t/s
Inference: 7.97 t/s @ 4K context

That's actually slower than 8 GPUs. Coming to think of it, 8 GPUs might even be slower than 6 GPUs, but I can't test it because the model won't fit. So one of the other factors doesn't scale.

Still, it seems worth it to at least support TP > 8 GPUs if it should ever make sense (egs., to fit a larger model), and it's a 1-line change.

@turboderp can you integrate the above fix? Or entertain a tiny PR? The 8 GPU peer limit is an old CUDA limitation, I don't see NVidia moving away from that in the near future.

Getting P2P to work for inference will take some more work, will try that next (it will have the 8 GPU limit).

EDIT: Also tried with 12 GPUs:
Inference: 7.04 t/s @ 4K context, so it does go down.

grimulkan · 2024-10-18T23:59:45Z

After monkeying with it a bit, I found that allgather was almost the same speed using PCIe P2P vs pinned host buffers in my setup. Maybe slightly faster, but probably not worth it, unless DRAM bandwidth was very low or something. The ring communication & sync logic was extremely annoying without NCCL.

I didn't try prototyping gather, but it appears Exllama expects the result of that op to end up in pinned host memory anyway. I didn't bother with some kind of PCIe P2P broadcast tree either, it's going to be slower than going via the host.

So... that's that. The bottlenecks are maybe not DRAM or PCIe bandwidth anymore, but the other things turbo mentioned.

Maybe with NVLink it makes a difference, but if you have that, you likely have multiple 80GB GPUs and have no need for Exllama.

grimulkan closed this as completed Jul 26, 2024

grimulkan reopened this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curious about Exllama+TP #571

Curious about Exllama+TP #571

grimulkan commented Jul 25, 2024 •

edited

Loading

turboderp commented Jul 26, 2024

grimulkan commented Jul 26, 2024

grimulkan commented Jul 26, 2024 •

edited

Loading

turboderp commented Jul 27, 2024

grimulkan commented Jul 27, 2024 •

edited

Loading

turboderp commented Aug 8, 2024

grimulkan commented Aug 8, 2024 •

edited

Loading

grimulkan commented Sep 13, 2024

turboderp commented Sep 13, 2024

grimulkan commented Sep 14, 2024

grimulkan commented Oct 2, 2024 •

edited

Loading

grimulkan commented Oct 18, 2024

grimulkan commented Oct 18, 2024 •

edited

Loading

grimulkan commented Oct 18, 2024 •

edited

Loading

Curious about Exllama+TP #571

Curious about Exllama+TP #571

Comments

grimulkan commented Jul 25, 2024 • edited Loading

turboderp commented Jul 26, 2024

grimulkan commented Jul 26, 2024

grimulkan commented Jul 26, 2024 • edited Loading

turboderp commented Jul 27, 2024

grimulkan commented Jul 27, 2024 • edited Loading

turboderp commented Aug 8, 2024

grimulkan commented Aug 8, 2024 • edited Loading

grimulkan commented Sep 13, 2024

turboderp commented Sep 13, 2024

grimulkan commented Sep 14, 2024

grimulkan commented Oct 2, 2024 • edited Loading

grimulkan commented Oct 18, 2024

grimulkan commented Oct 18, 2024 • edited Loading

grimulkan commented Oct 18, 2024 • edited Loading

grimulkan commented Jul 25, 2024 •

edited

Loading

grimulkan commented Jul 26, 2024 •

edited

Loading

grimulkan commented Jul 27, 2024 •

edited

Loading

grimulkan commented Aug 8, 2024 •

edited

Loading

grimulkan commented Oct 2, 2024 •

edited

Loading

grimulkan commented Oct 18, 2024 •

edited

Loading

grimulkan commented Oct 18, 2024 •

edited

Loading