-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'mike/add_encoder_doc' into 'main'
Add Encoder-Decoder Parallelism Documentation See merge request ADLR/megatron-lm!2086
- Loading branch information
Showing
2 changed files
with
55 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
encoder-decoder-parallelism package | ||
=================================== | ||
|
||
Mcore (as of 0.9) supports heterogeneous parallelism for encoder-decoder models. | ||
In particular, the user is now able to specify the amount of tensor and pipeline parallelism and have it be | ||
distinct from that in the decoder. | ||
|
||
Submodules | ||
---------- | ||
|
||
Encoder Pipeline Parallelism | ||
---------------------------- | ||
|
||
Supported in: T5, LLaVa. | ||
|
||
The new argument for encoder parallelism is `--encoder-pipeline-model-parallel-size`. This argument is completely distinct | ||
from the usual argument that controls pipelining: `--pipeline-model-parallel-size`, which controls the amount of pipelining in the decoder | ||
in the context of encoder-decoder models. | ||
|
||
The total amount of pipelining in an encoder-decoder model is the sum of these two arguments. By default, the amount of | ||
encoder pipelining is 0, and the amount of decoder pipelining is 1, meaning that the encoder & decoder share the single pipeline rank. | ||
If `--pipeline-model-parallel-size` > 1,then the amount of encoder parallelism has to be specified and has to be greater than 0. | ||
This is because we are not able to share pipeline ranks between the encoder and decoder anymore. | ||
|
||
Encoder Tensor Parallelism | ||
-------------------------- | ||
|
||
Supported in: LLaVa. | ||
|
||
Since we expect encoders to be much smaller than decoders, we also give users the ability to set a different amount of tensor | ||
parallelism than the decoder. This is achieved with the argument `--encoder-tensor-model-parallel-size`. To use this option, you must | ||
be using encoder pipeline parallelism (ie, `--encoder-pipeline-model-parallel-size` > 0). | ||
|
||
Unlike with encoder pipeline parallelism, which was unrestricted by the amount of decoder pipeline parallelism, we only allow encoders to have | ||
less than or the same amount of tensor parallelism as the decoder. The summary of how we do this is that within p2p_communication.py, we have | ||
to send the activations of one encoder rank to several decoder ranks; correspondingly, we have to add support for summing gradients from several | ||
(downstream) decoder ranks for the encoder rank. We have not seen a quantization-related degradation from summing these gradient tensors | ||
together yet; it could happen in very large models. | ||
|
||
|
||
Number of GPUs Required | ||
----------------------- | ||
|
||
The total amount of GPUs required to train a model when these options enabled is: | ||
|
||
dp * etp * epp * cp + dp * tp * pp * cp | ||
|
||
where: | ||
dp: amount of data parallelism (this is the same for the encoder & decoder) | ||
[e]tp: amount of tensor parallelism | ||
[e]pp: amount of pipeline parallelism | ||
cp: amount of context parallelism (as with dp, this is the same for the encoder & decoder) | ||
|
||
The default value of this argument is 0; in practice, we will use the amount of tensor parallelism in the decoder to construct the encoder. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,3 +17,4 @@ API Guide | |
datasets | ||
num_microbatches_calculator | ||
optimizer_param_scheduler | ||
encoder_decoder_parallelism |