Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paddle] Add TP overlap #443

Merged
merged 3 commits into from
Nov 23, 2023
Merged

Conversation

Tom-Zheng
Copy link
Contributor

This PR implements the overlap between TP backward allreduce kernel and GEMM kernel.

@Tom-Zheng
Copy link
Contributor Author

Add @jeng1220 for vis.

@zlsh80826
Copy link
Collaborator

/te-ci

@Tom-Zheng Tom-Zheng marked this pull request as draft September 25, 2023 04:06
Signed-off-by: Tian Zheng (Engrg-Hardware 1) <[email protected]>
@Tom-Zheng Tom-Zheng force-pushed the tizheng/add_tp_comm_async branch from f8bb800 to 3301d4e Compare November 14, 2023 02:42
@Tom-Zheng Tom-Zheng marked this pull request as ready for review November 14, 2023 02:43
@Tom-Zheng
Copy link
Contributor Author

@jeng1220 Ready for review

@mingxu1067
Copy link
Collaborator

/te-ci paddle

@jeng1220
Copy link
Contributor

jeng1220 commented Nov 21, 2023

LGTM.

@timmoon10 ,
All tests passed.
Could you please merge this branch if everything looks good?

@timmoon10
Copy link
Collaborator

/te-ci paddle

@jeng1220
Copy link
Contributor

The te-ci/L0_paddle_unittest--L40_1GPU didn't pass. The reason was downloading dataset failed:

E RuntimeError: Cannot download https://dataset.bj.bcebos.com/mnist/train-images-idx3-ubyte.gz within retry limit 3
/usr/local/lib/python3.10/dist-packages/paddle/dataset/common.py:93: RuntimeError
=========================== short test summary info ============================
FAILED ../../examples/paddle/mnist/test_single_gpu_mnist.py::TestMNIST::test_te_bf16
FAILED ../../examples/paddle/mnist/test_single_gpu_mnist.py::TestMNIST::test_te_fp8
FAILED ../../examples/paddle/mnist/test_single_gpu_mnist.py::TestMNIST::test_te_fp8_calibration
======================== 3 failed in 364.53s (0:06:04) =========================

@timmoon10 ,
The error was a network issue, not related to this PR.
I hope to merge this PR first, then we will try to find a better way to solve the downloading dataset issue.
(I just tried downloading the dataset with my local machine. Everything was fine. The network issue might be related to region.)

@timmoon10 timmoon10 merged commit 666539f into NVIDIA:main Nov 23, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants