[PyTorch] Distributed intermediate/activation tensors for FSDP #687

denera · 2024-02-28T01:29:18Z

torch.distributed.fsdp.FullyShardedDataParallel cannot scatter/gather the intermediate/activation tensors that TE modules pack into the autograd context at the end of their forward passes, resulting in globally sized activation and Fp8 weight tensors staying in memory.

This PR provides a te.distributed.prepare_te_modules_for_fsdp(fsdp_root) API that inserts references to the correct FSDP process group into FSDP-wrapped TE modules in a given model. The TE modules then use these process groups to scatter the intermediate/activation tensors at the end of the forward pass before packing them into the autograd context. The same tensors are gathered in the beginning of the backward pass before compute.

Using te.distributed.checkpoint() turns off the scatters/gathers to avoid unnecessary comm for tensors that need to be recomputed anyway.

`nn.Sequential( 3 x te.LayerNormMLP )` before Fp8/intermediate sharding:

`nn.Sequential( 3 x te.LayerNormMLP )` after Fp8/intermediate sharding:

rahul003 · 2024-03-06T22:47:43Z

transformer_engine/pytorch/distributed.py

+    fsdp_states, fsdp_modules = _get_fsdp_states_with_modules(fsdp_root)
+    for state, module in zip(fsdp_states, fsdp_modules):
+        if _is_te_module(module):
+            setattr(module, "fsdp_wrapepd", True)


denera · 2024-05-21T22:26:51Z

/te-ci pytorch

denera · 2024-05-22T16:01:32Z

/te-ci pytorch

transformer_engine/pytorch/distributed.py

transformer_engine/pytorch/module/linear.py

examples/pytorch/fsdp/fsdp.py

…s distribute their activations after the forward pass and gather them before the backward pass Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

…or base TE modules Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

…ass and gathered before forward Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

denera · 2024-05-23T19:57:13Z

/te-ci pytorch

denera · 2024-06-04T18:54:34Z

/te-ci pytorch

timmoon10

LGTM

ksivaman · 2024-06-05T23:45:48Z

transformer_engine/pytorch/distributed.py

@@ -856,3 +865,110 @@ def allreduce(
    handle = torch.distributed.all_reduce(input_, group=tp_group, async_op=async_op)

    return input_, handle
+
+
+def _fsdp_scatter_tensors(


Interesting that the linter doesn't complain about the missing docstring, is this due to the __all__ decl or just using internal function convention with _func_name? Either way I think this is good practice going forward as well instead of adding filler docs!

Yes, I believe PyLint ignores some warnings by default for internal functions designated with the leading underscore.

ksivaman · 2024-06-06T19:58:57Z

/te-ci pytorch

ksivaman

LGTM

ksivaman · 2024-06-07T00:04:09Z

te-ci

ksivaman · 2024-06-07T00:42:54Z

/te-ci

ksivaman · 2024-06-07T00:56:02Z

/te-ci pytorch

ksivaman · 2024-06-07T08:12:15Z

Pipeline 15637774

…A#687) * New TE wrapper for PyTorch FullyShardedDataParallel to make TE modules distribute their activations after the forward pass and gather them before the backward pass Signed-off-by: Alp Dener <[email protected]> * simplified TE module setup for FSDP comms Signed-off-by: Alp Dener <[email protected]> * FSDP scatter/gather for tensors saved into autograd ctx now working for base TE modules Signed-off-by: Alp Dener <[email protected]> * make sure activation recompute disables FSDP scatter/gather Signed-off-by: Alp Dener <[email protected]> * make sure Fp8 weight buffers are sharded at the end of the backward pass and gathered before forward Signed-off-by: Alp Dener <[email protected]> * Fixed typo in attribute name Signed-off-by: Alp Dener <[email protected]> * fixed bug in finding FSDP-wrapped TE modules Signed-off-by: Alp Dener <[email protected]> * fixed typo in fp8 weight tensor name Signed-off-by: Alp Dener <[email protected]> * fixed incorrect # of gradients Signed-off-by: Alp Dener <[email protected]> * Added fp8 amax gradient hook tensor to the parameter reset Signed-off-by: Alp Dener <[email protected]> * get rid of erroneous dummy tensor leftover from incorrect rebase Signed-off-by: Alp Dener <[email protected]> * Linting fixes Signed-off-by: Alp Dener <[email protected]> * fixing git snafu and removing debug statements Signed-off-by: Alp Dener <[email protected]> --------- Signed-off-by: Alp Dener <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

denera requested review from ptrendx and ksivaman February 28, 2024 01:29

denera self-assigned this Feb 28, 2024

denera force-pushed the databricks/distribute-fp8-weights-fsdp branch from d18b49f to 71f696b Compare February 28, 2024 15:24

denera marked this pull request as ready for review February 28, 2024 15:33

denera changed the title ~~[PyTorch] Distributed intermediate/activation tensors for FSDP -- WIP~~ [PyTorch] Distributed intermediate/activation tensors for FSDP Mar 6, 2024

rahul003 reviewed Mar 6, 2024

View reviewed changes

denera force-pushed the databricks/distribute-fp8-weights-fsdp branch from 1b822f4 to 9d8a7f5 Compare March 11, 2024 19:41

denera force-pushed the databricks/distribute-fp8-weights-fsdp branch 2 times, most recently from 0267b08 to d62330d Compare April 16, 2024 00:32

ptrendx added the 1.7.0 label May 3, 2024

denera force-pushed the databricks/distribute-fp8-weights-fsdp branch 2 times, most recently from b260d50 to 8c6e9b7 Compare May 22, 2024 16:01

timmoon10 reviewed May 23, 2024

View reviewed changes

denera added 13 commits May 23, 2024 19:23

New TE wrapper for PyTorch FullyShardedDataParallel to make TE module…

52da864

…s distribute their activations after the forward pass and gather them before the backward pass Signed-off-by: Alp Dener <[email protected]>

simplified TE module setup for FSDP comms

979f202

Signed-off-by: Alp Dener <[email protected]>

FSDP scatter/gather for tensors saved into autograd ctx now working f…

f76c4bd

…or base TE modules Signed-off-by: Alp Dener <[email protected]>

make sure activation recompute disables FSDP scatter/gather

8db979e

Signed-off-by: Alp Dener <[email protected]>

make sure Fp8 weight buffers are sharded at the end of the backward p…

ea1d397

…ass and gathered before forward Signed-off-by: Alp Dener <[email protected]>

Fixed typo in attribute name

29ff046

Signed-off-by: Alp Dener <[email protected]>

fixed bug in finding FSDP-wrapped TE modules

a588c87

Signed-off-by: Alp Dener <[email protected]>

fixed typo in fp8 weight tensor name

d558956

Signed-off-by: Alp Dener <[email protected]>

fixed incorrect # of gradients

b8b5ff6

Signed-off-by: Alp Dener <[email protected]>

Added fp8 amax gradient hook tensor to the parameter reset

9c9dfb2

Signed-off-by: Alp Dener <[email protected]>

get rid of erroneous dummy tensor leftover from incorrect rebase

849e20c

Signed-off-by: Alp Dener <[email protected]>

Linting fixes

42a8ddf

Signed-off-by: Alp Dener <[email protected]>

fixing git snafu and removing debug statements

518df99

Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the databricks/distribute-fp8-weights-fsdp branch from 7cb919b to 518df99 Compare May 23, 2024 19:24

ptrendx removed the 1.7.0 label May 30, 2024

Merge branch 'main' into databricks/distribute-fp8-weights-fsdp

140c398

denera requested review from timmoon10 and rahul003 June 4, 2024 23:51

timmoon10 approved these changes Jun 5, 2024

View reviewed changes

ksivaman reviewed Jun 5, 2024

View reviewed changes

Merge branch 'main' into databricks/distribute-fp8-weights-fsdp

ca80725

ksivaman approved these changes Jun 6, 2024

View reviewed changes

ksivaman merged commit 0edf30b into NVIDIA:main Jun 7, 2024
9 of 20 checks passed

kshitij12345 mentioned this pull request Jul 2, 2024

TransformerEngine - Intermediate tensor sharding Lightning-AI/lightning-thunder#695

Merged

[PyTorch] Distributed intermediate/activation tensors for FSDP #687

[PyTorch] Distributed intermediate/activation tensors for FSDP #687

Uh oh!

Conversation

denera commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nn.Sequential( 3 x te.LayerNormMLP ) before Fp8/intermediate sharding:

nn.Sequential( 3 x te.LayerNormMLP ) after Fp8/intermediate sharding:

Uh oh!

rahul003 Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

denera commented May 21, 2024

Uh oh!

denera commented May 22, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denera commented May 23, 2024

Uh oh!

denera commented Jun 4, 2024

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

denera Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Jun 6, 2024

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Jun 7, 2024

Uh oh!

ksivaman commented Jun 7, 2024

Uh oh!

ksivaman commented Jun 7, 2024

Uh oh!

ksivaman commented Jun 7, 2024

Uh oh!

Uh oh!

Uh oh!

denera commented Feb 28, 2024 •

edited

Loading

`nn.Sequential( 3 x te.LayerNormMLP )` before Fp8/intermediate sharding:

`nn.Sequential( 3 x te.LayerNormMLP )` after Fp8/intermediate sharding: