Skip to content

Commit

Permalink
Merge branch 'mike/add_encoder_doc' into 'main'
Browse files Browse the repository at this point in the history
Add Encoder-Decoder Parallelism Documentation

See merge request ADLR/megatron-lm!2086
  • Loading branch information
ko3n1g committed Sep 11, 2024
2 parents db0fc33 + f218582 commit fe1640a
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 0 deletions.
54 changes: 54 additions & 0 deletions docs/source/api-guide/encoder_decoder_parallelism.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
encoder-decoder-parallelism package
===================================

Mcore (as of 0.9) supports heterogeneous parallelism for encoder-decoder models.
In particular, the user is now able to specify the amount of tensor and pipeline parallelism and have it be
distinct from that in the decoder.

Submodules
----------

Encoder Pipeline Parallelism
----------------------------

Supported in: T5, LLaVa.

The new argument for encoder parallelism is `--encoder-pipeline-model-parallel-size`. This argument is completely distinct
from the usual argument that controls pipelining: `--pipeline-model-parallel-size`, which controls the amount of pipelining in the decoder
in the context of encoder-decoder models.

The total amount of pipelining in an encoder-decoder model is the sum of these two arguments. By default, the amount of
encoder pipelining is 0, and the amount of decoder pipelining is 1, meaning that the encoder & decoder share the single pipeline rank.
If `--pipeline-model-parallel-size` > 1,then the amount of encoder parallelism has to be specified and has to be greater than 0.
This is because we are not able to share pipeline ranks between the encoder and decoder anymore.

Encoder Tensor Parallelism
--------------------------

Supported in: LLaVa.

Since we expect encoders to be much smaller than decoders, we also give users the ability to set a different amount of tensor
parallelism than the decoder. This is achieved with the argument `--encoder-tensor-model-parallel-size`. To use this option, you must
be using encoder pipeline parallelism (ie, `--encoder-pipeline-model-parallel-size` > 0).

Unlike with encoder pipeline parallelism, which was unrestricted by the amount of decoder pipeline parallelism, we only allow encoders to have
less than or the same amount of tensor parallelism as the decoder. The summary of how we do this is that within p2p_communication.py, we have
to send the activations of one encoder rank to several decoder ranks; correspondingly, we have to add support for summing gradients from several
(downstream) decoder ranks for the encoder rank. We have not seen a quantization-related degradation from summing these gradient tensors
together yet; it could happen in very large models.


Number of GPUs Required
-----------------------

The total amount of GPUs required to train a model when these options enabled is:

dp * etp * epp * cp + dp * tp * pp * cp

where:
dp: amount of data parallelism (this is the same for the encoder & decoder)
[e]tp: amount of tensor parallelism
[e]pp: amount of pipeline parallelism
cp: amount of context parallelism (as with dp, this is the same for the encoder & decoder)

The default value of this argument is 0; in practice, we will use the amount of tensor parallelism in the decoder to construct the encoder.
1 change: 1 addition & 0 deletions docs/source/api-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ API Guide
datasets
num_microbatches_calculator
optimizer_param_scheduler
encoder_decoder_parallelism

0 comments on commit fe1640a

Please sign in to comment.