add PrepareFloat8ModuleInput for sequence parallel #275

wanchaol · 2024-06-09T19:23:30Z

when applying Sequence Parallel to a module with more than 2 linear layers for input proj, we often want to transform from Shard to Replicate once (allgather once) and then reuse the allgathered result, for fp8 we would need to do the casting before the shard -> replicate so that we can perform the fp8 allgather.

This PR subclasses the PrepareModuleInput to add the fp8 casting logic to make sure we run the fp8 allgather instead of bf16 allgather then do the casting for computation.

Also adjust the test cases to test the real ffn case for sequence parallel

torchtitan perf benchmarks (8 H100 devgpu, Llama3 8b, 2-way DP, 4-way TP):

eager (with no fp8 allgather): 3265 wps
eager (with fp8 allgather, this PR): 3900 wps
compile (without fp8 allgather): 5850 wps
compile (with fp8 allgather): 6592 wps, with 37% MFU on H100

So even in eager we got around 20% perf improvement with every allgather runs in fp8, and compiled fp8 allgather perf is more than doubled (102% more WPS) :)

when applying Sequence Parallel to a module with more than 2 linear layers for input proj, we often want to transform from Shard to Replicate once (allgather once) and then reuse the allgathered result, for fp8 we would need to do the casting before the shard -> replicate so that we can perform the fp8 allgather. This PR subclasses the PrepareModuleInput to add the fp8 casting logic to make sure we run the fp8 allgather instead of bf16 allgather then do the casting for computation. Also adjust the test cases to test the real ffn case for sequence parallel

facebook-github-bot · 2024-06-10T05:34:48Z

@wanchaol has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

float8_experimental/float8_tensor_parallel.py

vkuzo · 2024-06-10T21:30:36Z

float8_experimental/float8_tensor_parallel.py

@@ -109,3 +114,94 @@ def _apply(self, module: nn.Module, device_mesh: DeviceMesh) -> nn.Module:
            )

        return super()._apply(module, device_mesh)
+
+
+class PrepareFloat8ModuleInput(PrepareModuleInput):


nit: maybe we can have e4m3 in the name, and maybe add a TODO to support the AMD version of e4m3 eventually?

maybe a quick docblock to explain that this is ensuring the float8 cast happens before the all-gather if there are multiple float8 users of the input activation?

maybe we can have e4m3 in the name, and maybe add a TODO to support the AMD version of e4m3 eventually

I wonder what's your thought on these two choice: 1. make e4m3 appears in the name of this class 2. make this class constructor take an additional argument of fp8 dtype, i.e. float8_dtype=torch.float8_e4m3fn, and we default to this e4m3fn dtype, and then later we can add on the AMD version of e4m3 by passing a different float8_dtype` arg?

make this class constructor take an additional argument of fp8 dtype

sgtm

vkuzo · 2024-06-10T21:35:13Z

float8_experimental/float8_tensor_parallel.py

+
+        # search for ScaledMM configs for all the submodules and make sure they are the same
+        fwd_linear_config = None
+        for mod in module.modules():


WDYT something like the following to avoid the logic below?

PrepareFloat8ModuleInput takes a ScaledMMConfig constructor argument

Float8DynamicLinear has logic where if the input is already a Float8Tensor, there is a check to verify the config matches

?

I thought about this option too, the concern I have on this is that, this would make the API diverges from the TP API offered in core, so making the switch between fp8 and bf16 be harder.

Also I think user would need to know how to construct the ScaledMMConfig, this basically make ScaledMMConfig be a public facing API. I wasn't sure this is sth we want or not?

this basically make ScaledMMConfig be a public facing API

Yeah, good point, that's not intended to be a user facing thing. How about something like requiring a name of the module to get the config from?

I think the user API of the current code is great (no extra args), but the restriction that all configs in the module need the same config is not ideal. If we are ok with changing that later, current API sgtm.

I think this make sense! Let me draft up the changes for accept the module_fqn to get the scaled mm config from. My current thinking on how we could approach this:

We add a fwd_config_module_fqn arg to the constructor so that user can specify which module config to take from

This arg could be optional, where if user don't pass it in, we still do the search and restrict all configs in this specific module should all be the same.

float8_experimental/float8_tensor_parallel.py

test/test_dtensor.py

vkuzo

awesome!

facebook-github-bot · 2024-06-12T19:42:18Z

@wanchaol has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-06-12T22:27:54Z

@wanchaol merged this pull request in 7c7cbae.

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI

This PR is a follow up PR to enable fp8 allgather in TP after these PR landed: * pytorch/pytorch#128431 * pytorch-labs/float8_experimental#275 One need to update their pytorch/float8_experimental to have those changes in to train with fp8 changes. Since fp8 is not enabled as part of our integration tests yet, there should be no issues on CI or trains that does not use fp8

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2024

lint fixes

3edc3ec

wanchaol requested review from vkuzo, awgu, drisspg and weifengpy June 10, 2024 17:31

vkuzo reviewed Jun 10, 2024

View reviewed changes

float8_experimental/float8_tensor_parallel.py Outdated Show resolved Hide resolved

vkuzo reviewed Jun 10, 2024

View reviewed changes

wanchaol added 2 commits June 11, 2024 22:39

address comments from vkuzo

757be52

lint

77c2353

vkuzo reviewed Jun 12, 2024

View reviewed changes

float8_experimental/float8_tensor_parallel.py Show resolved Hide resolved

vkuzo reviewed Jun 12, 2024

View reviewed changes

test/test_dtensor.py Show resolved Hide resolved

vkuzo approved these changes Jun 12, 2024

View reviewed changes

wanchaol added 2 commits June 12, 2024 12:39

add more docs/comment and tests

6d04ad9

lint

edec777

facebook-github-bot closed this in 7c7cbae Jun 12, 2024

facebook-github-bot added the Merged label Jun 12, 2024

wanchaol mentioned this pull request Jun 12, 2024

enable TP fp8 allgather with PrepareFloat8ModuleInput pytorch/torchtitan#393

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add PrepareFloat8ModuleInput for sequence parallel #275

add PrepareFloat8ModuleInput for sequence parallel #275

wanchaol commented Jun 9, 2024 •

edited

Loading

facebook-github-bot commented Jun 10, 2024

vkuzo Jun 10, 2024

vkuzo Jun 10, 2024

wanchaol Jun 11, 2024

vkuzo Jun 11, 2024

vkuzo Jun 10, 2024

wanchaol Jun 11, 2024

vkuzo Jun 11, 2024

wanchaol Jun 11, 2024

vkuzo left a comment

facebook-github-bot commented Jun 12, 2024

facebook-github-bot commented Jun 12, 2024

add PrepareFloat8ModuleInput for sequence parallel #275

add PrepareFloat8ModuleInput for sequence parallel #275

Conversation

wanchaol commented Jun 9, 2024 • edited Loading

facebook-github-bot commented Jun 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 12, 2024

facebook-github-bot commented Jun 12, 2024

wanchaol commented Jun 9, 2024 •

edited

Loading