[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"` #1343

denera · 2024-11-20T21:52:04Z

Description

te.Linear currently only supports TP overlap with parallel_mode="row" where it overlaps reduce-scatter in the forward pass, and all-gather with dgrad in the backward pass.

This PR adds new options to enable all-gather overlap in the forward pass, and reduce-scatter overlap with dgrad in the backward pass, when parallel_mode="column".

Fixes #1312

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

denera · 2024-11-20T21:54:05Z

/te-ci pytorch L1

timmoon10

Overall LGTM, pending CI.

timmoon10 · 2024-11-21T23:24:48Z

transformer_engine/pytorch/module/linear.py

        ub_overlap_ag: bool = False,
+        ub_overlap_rs: bool = False,
+        ub_bulk_dgrad: bool = False,
+        ub_bulk_wgrad: bool = False,
        ub_name: Optional[str] = None,


We should seriously consider deprecating these UB options and just passing in a dict. The UB interface is unstable and will likely be so for some while. A dict would be better for backward compatibility (reinterpret old options) and forward compatibility (ignore unknown options). This would be especially helpful for Mcore integration.

For example, the operation-based API passes in UB options with a dict:

TransformerEngine/transformer_engine/pytorch/ops/basic/basic_linear.py

Line 105 in 6b98768

userbuffers_options: Optional[dict[str, Any]] = None,

timmoon10 · 2024-11-21T23:30:11Z

transformer_engine/pytorch/module/linear.py

+        assert not (self.ub_overlap_rs_fprop and self.ub_overlap_ag_fprop), "Internal TE error!"
+        assert not (self.ub_overlap_ag_dgrad and self.ub_overlap_rs_dgrad), "Internal TE error!"
+        assert not (
+            self.ub_overlap_rs_dgrad and (self.ub_bulk_dgrad or self.ub_bulk_wgrad)
+        ), "Internal TE error!"


More descriptive error messages would be helpful.

ksivaman

LGTM, much needed

… in sequence-parallel Linear backward Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

…dated unit tests Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

…ons in te.Linear Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

denera · 2024-12-17T21:30:12Z

/te-ci pytorch L1

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

denera · 2024-12-18T00:42:29Z

/te-ci pytorch L1

denera added enhancement New feature or request 1.13.0 labels Nov 20, 2024

denera requested review from timmoon10 and ksivaman November 20, 2024 21:52

denera self-assigned this Nov 20, 2024

denera force-pushed the linear-tp-overlap-ag-fprop-rs-dgrad branch from 90458d4 to 4e3e61a Compare November 20, 2024 21:53

timmoon10 approved these changes Nov 21, 2024

View reviewed changes

ksivaman approved these changes Nov 27, 2024

View reviewed changes

denera and others added 5 commits December 17, 2024 20:47

support AG overlap in sequence-parallel Linear forward and RS overlap…

0d7770b

… in sequence-parallel Linear backward Signed-off-by: Alp Dener <[email protected]>

implemented TP overlap support for column-parallel te.Linear

89426cc

Signed-off-by: Alp Dener <[email protected]>

fixed backward pass for te.Linear column-parallel with TP overlap, up…

4a8d55b

…dated unit tests Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d9b943

for more information, see https://pre-commit.ci

improved error messages for internal failure to infer TP overlap opti…

360c127

…ons in te.Linear Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the linear-tp-overlap-ag-fprop-rs-dgrad branch from 3951993 to 360c127 Compare December 17, 2024 20:48

denera added 1.14.0 and removed 1.13.0 labels Dec 17, 2024

fixed linting errors

3afa7c1

Signed-off-by: Alp Dener <[email protected]>

denera and others added 2 commits December 18, 2024 00:40

fixed incorrect TP overlap option asserts

bb0a330

Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

744a96f

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"` #1343

[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"` #1343

denera commented Nov 20, 2024

denera commented Nov 20, 2024

timmoon10 left a comment

timmoon10 Nov 21, 2024

timmoon10 Nov 21, 2024

ksivaman left a comment

denera commented Dec 17, 2024

denera commented Dec 18, 2024

[PyTorch] Adding TP overlap support for te.Linear with parallel_mode="column" #1343

Are you sure you want to change the base?

[PyTorch] Adding TP overlap support for te.Linear with parallel_mode="column" #1343

Conversation

denera commented Nov 20, 2024

Description

Type of change

Checklist:

denera commented Nov 20, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

timmoon10 Nov 21, 2024

Choose a reason for hiding this comment

timmoon10 Nov 21, 2024

Choose a reason for hiding this comment

ksivaman left a comment

Choose a reason for hiding this comment

denera commented Dec 17, 2024

denera commented Dec 18, 2024

[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"` #1343

[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"` #1343