Skip to content

Tensor parallelism #374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 22, 2025
Merged

Tensor parallelism #374

merged 26 commits into from
Jul 22, 2025

Conversation

le1nux
Copy link
Member

@le1nux le1nux commented Jun 11, 2025

What does this PR do?

This PR adds support for Tensor Parallelism (including Sequence Parallelism).
Additionally, this PR adds a debugging toolkit to track the input and output tensors during a forward pass, gradients during the backward pass and weight tensors.
Tensors can bei either normal Tensors or DTensors.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux changed the base branch from main to fsdp2_activation_checkpointing June 11, 2025 21:46
@le1nux le1nux marked this pull request as ready for review July 16, 2025 11:57
@le1nux le1nux requested review from flxst and rrutmann July 16, 2025 11:57
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

I think there is a problem regarding TP + GELU, see my comment in src/modalities/models/model_factory.py.

Copy link
Collaborator

@rrutmann rrutmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I just requested some minor changes in the test

@le1nux
Copy link
Member Author

le1nux commented Jul 18, 2025

Great work!

I think there is a problem regarding TP + GELU, see my comment in src/modalities/models/model_factory.py.

I have added GELU support now for TP (including unit tests)

@le1nux le1nux requested review from rrutmann and flxst July 18, 2025 08:41
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! Only some minor issues left, see comments.

Copy link
Collaborator

@rrutmann rrutmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@le1nux le1nux requested review from flxst and rrutmann July 18, 2025 13:50
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Base automatically changed from fsdp2_activation_checkpointing to main July 22, 2025 14:19
@le1nux le1nux merged commit 1e4d28e into main Jul 22, 2025
4 of 6 checks passed
@le1nux le1nux deleted the tensor_parallelism branch July 22, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants