Tensor parallel in distributed inference #10118

MohmedMonsef · 2024-11-07T14:19:23Z

MohmedMonsef
Nov 7, 2024

In the documentation, it recommends using tensor parallelism for cases with a single node and multiple GPUs If your model is too large to fit in a single GPU.

My question is: why is tensor parallelism preferred over pipeline parallelism in this setup, even though tensor parallelism involves more communication?

What are the specific advantages of using tensor parallelism in this setup?

andoorve · 2024-11-07T18:43:48Z

andoorve
Nov 7, 2024
Collaborator

Within a node, networking is generally fast. This means that the added communication overhead of TP is not as much of a concern compared to the big improvement you get from using multiple GPUs at once and from batch efficiencies present in TP but not PP.

4 replies

MohmedMonsef Nov 8, 2024
Author

Thanks for your reply.
I’d love to dive a bit deeper into the specifics—could you elaborate on the types of improvements and efficiencies in TP that aren’t achievable with pipeline parallelism (PP)? For instance, are there certain request patterns or batching methods that particularly benefit from TP over PP?

andoorve Nov 8, 2024
Collaborator

Hey sure,

So with tensor parallelism we have the downside that communication cost is higher. However, this is not typically a concern for us in a single node with very good networking, and in the case where we don't have a lot of long prefills. With TP you can enjoy very low latency because you can use all the memory bandwidth available across all your GPUs. We are memory bound most of the time so this is preferable. PP is ideal in a case where communication is expensive compared to this (poor interconnect, more communication volume from prefills, cross-node).

MohmedMonsef Nov 9, 2024
Author

It is more clear now, thanks.

Tomorrowdawn Feb 26, 2025

Although this discussion has already been closed, for the sake of clarity and accurate information for future readers who find this via search engines, I'd like to make an important addition to this explanation. Pipeline parallelism and autoregressive inference are completely incompatible, because for a micro batch, when it reaches the final stage, it doesn't exit but instead returns to the first stage, meaning it will re-occupy resources in the first stage.

For example, let's say you have a two-stage model and two GPUs. Each GPU can support inference for one request. Now you use pipeline parallelism:

Time step:  1        2        3        4
GPU 1:     [req1]    -      [req1]     -      ← Stage 1
GPU 2:      -      [req1]    -      [req1]    ← Stage 2

When attempting to pipeline a second request:

Time step:  1        2        3        4
GPU 1:     [req1]   [req2]   [req1+req3]←conflict!  ← Stage 1
GPU 2:      -      [req1]   [req2]   [req1]    ← Stage 2

When req2 enters the second stage, logically, you would want to place req3 in the first stage, but at this point req1 also needs to return to the first stage for its next token generation, creating a resource conflict. In reality, you can never achieve true pipeline parallelism in autoregressive inference and are therefore always forced to tolerate significant pipeline bubbles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tensor parallel in distributed inference #10118

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tensor parallel in distributed inference #10118

Uh oh!

MohmedMonsef Nov 7, 2024

Replies: 1 comment · 4 replies

Uh oh!

andoorve Nov 7, 2024 Collaborator

Uh oh!

MohmedMonsef Nov 8, 2024 Author

Uh oh!

andoorve Nov 8, 2024 Collaborator

Uh oh!

MohmedMonsef Nov 9, 2024 Author

Uh oh!

Tomorrowdawn Feb 26, 2025

MohmedMonsef
Nov 7, 2024

Replies: 1 comment 4 replies

andoorve
Nov 7, 2024
Collaborator

MohmedMonsef Nov 8, 2024
Author

andoorve Nov 8, 2024
Collaborator

MohmedMonsef Nov 9, 2024
Author