Merge branch 'mike/add_encoder_doc' into 'main'

Add Encoder-Decoder Parallelism Documentation See merge request ADLR/megatron-lm!2086
NVIDIA · Sep 11, 2024 · fe1640a · fe1640a
2 parents db0fc33 + f218582
commit fe1640a
Show file tree

Hide file tree

Showing 2 changed files with 55 additions and 0 deletions.
diff --git a/docs/source/api-guide/encoder_decoder_parallelism.rst b/docs/source/api-guide/encoder_decoder_parallelism.rst
@@ -0,0 +1,54 @@
+encoder-decoder-parallelism package
+===================================
+
+Mcore (as of 0.9) supports heterogeneous parallelism for encoder-decoder models.
+In particular, the user is now able to specify the amount of tensor and pipeline parallelism and have it be
+distinct from that in the decoder.
+
+Submodules
+----------
+
+Encoder Pipeline Parallelism
+----------------------------
+
+Supported in: T5, LLaVa.
+
+The new argument for encoder parallelism is `--encoder-pipeline-model-parallel-size`. This argument is completely distinct
+from the usual argument that controls pipelining: `--pipeline-model-parallel-size`, which controls the amount of pipelining in the decoder
+in the context of encoder-decoder models.
+
+The total amount of pipelining in an encoder-decoder model is the sum of these two arguments. By default, the amount of
+encoder pipelining is 0, and the amount of decoder pipelining is 1, meaning that the encoder & decoder share the single pipeline rank.
+If `--pipeline-model-parallel-size` > 1,then the amount of encoder parallelism has to be specified and has to be greater than 0.
+This is because we are not able to share pipeline ranks between the encoder and decoder anymore.
+
+Encoder Tensor Parallelism
+--------------------------
+
+Supported in: LLaVa.
+
+Since we expect encoders to be much smaller than decoders, we also give users the ability to set a different amount of tensor
+parallelism than the decoder. This is achieved with the argument `--encoder-tensor-model-parallel-size`. To use this option, you must
+be using encoder pipeline parallelism (ie, `--encoder-pipeline-model-parallel-size` > 0).
+
+Unlike with encoder pipeline parallelism, which was unrestricted by the amount of decoder pipeline parallelism, we only allow encoders to have
+less than or the same amount of tensor parallelism as the decoder. The summary of how we do this is that within p2p_communication.py, we have
+to send the activations of one encoder rank to several decoder ranks; correspondingly, we have to add support for summing gradients from several
+(downstream) decoder ranks for the encoder rank. We have not seen a quantization-related degradation from summing these gradient tensors
+together yet; it could happen in very large models.
+
+
+Number of GPUs Required
+-----------------------
+
+The total amount of GPUs required to train a model when these options enabled is:
+
+dp * etp * epp * cp + dp * tp * pp * cp
+
+where:
+dp: amount of data parallelism (this is the same for the encoder & decoder)
+[e]tp: amount of tensor parallelism
+[e]pp: amount of pipeline parallelism
+cp: amount of context parallelism (as with dp, this is the same for the encoder & decoder)
+
+The default value of this argument is 0; in practice, we will use the amount of tensor parallelism in the decoder to construct the encoder.
diff --git a/docs/source/api-guide/index.rst b/docs/source/api-guide/index.rst
@@ -17,3 +17,4 @@ API Guide
    datasets
    num_microbatches_calculator
    optimizer_param_scheduler
+   encoder_decoder_parallelism