Improving communication overlap for the case of multi kernel queue usage #1308

youngeunkwon0405 · 2024-11-02T15:55:30Z

Description

The current TP-overlap relay is on a single kernel queue to configure launch ordering to control compute-communication overlap, which fails to overlap when multi kernel queue is used.

This PR enforces launch ordering using the LaunchCompletionEvent feature between the communication kernel and the compute kernel to ensure the overlap.

This feature is specific to Hopper and applies only to bulk overlap cases.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

youngeunkwon0405 · 2024-11-02T15:58:30Z

@erhoo82 Hi Sangkug, this is a PR for launch ordering work. Could you please assign a reviewer?

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

denera · 2024-11-04T00:27:19Z

@youngeunkwon0405 The TP overlap unit tests explicitly set CUDA_DEVICE_MAX_CONNECTIONS=1 in tests/pytorch/distributed/test_comm_gemm_overlap.py:43. Could you update this to not set the environment variable for Hopper so the changes in this PR are tested in our CI?

Also please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests. Thanks!

youngeunkwon0405 · 2024-11-07T21:19:57Z

@youngeunkwon0405 The TP overlap unit tests explicitly set CUDA_DEVICE_MAX_CONNECTIONS=1 in tests/pytorch/distributed/test_comm_gemm_overlap.py:43. Could you update this to not set the environment variable for Hopper so the changes in this PR are tested in our CI?

Also please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests. Thanks!

Hi @denera, I have updated the test_comm_gemm_overlap.py file in the latest commit. Will it meet your expectations?

Also, could you please elaborate on more details about the following? I am new to writing a test and also new to the ci process.

please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests.

I have tested the modified test case only and the following was a new result.
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/.hypothesis/examples')
rootdir: /lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk
plugins: xdoctest-1.0.2, typeguard-4.3.0, xdist-3.6.1, shard-0.1.2, rerunfailures-14.0, mock-3.14.0, flakefinder-1.1.0, hypothesis-5.35.1, hydra-core-1.3.2, anyio-4.4.0
collecting ... collected 6 items
Running 6 items in this shard: tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 8 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 8 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 8 connections]

../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 8 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 8 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 8 connections] PASSED

========================= 6 passed in 93.35s (0:01:33) =========================

denera · 2024-11-14T07:01:20Z

/te-ci pytorch L1

denera

LGTM, pending rebase on latest TE/main and clean CI results.

youngeunkwon0405 · 2024-11-14T07:31:19Z

@denera Rebased with the main. Could you please let me know what the next step would be?

denera · 2024-11-15T18:01:10Z

/te-ci pytorch L1

youngeunkwon0405 · 2024-11-15T21:47:18Z

/te-ci pytorch L1

youngeunkwon0405 · 2024-11-21T01:32:32Z

/te-ci pytorch L1

denera · 2024-11-21T11:00:29Z

/te-ci pytorch L0 L1

denera

LGTM! We can merge, pending clean CI results. Your CI permissions might not have taken effect yet because I don't see any pipelines for this PR from your trigger. I triggered it again and the pass/fail should show up on GitHub when it's done.

denera · 2024-11-22T16:12:16Z

/te-ci pytorch L0 L1

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Youngeun Kwon <[email protected]>

…age (NVIDIA#1308) * draft implementation Signed-off-by: Youngeun Kwon <[email protected]> * compile error fix Signed-off-by: Youngeun Kwon <[email protected]> * fix compile error Signed-off-by: Youngeun Kwon <[email protected]> * remove print Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edit comments Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit the bulk-overlap test case Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add version guard Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add runtime version guard Signed-off-by: Youngeun Kwon <[email protected]> * fix the version guard Signed-off-by: Youngeun Kwon <[email protected]> --------- Signed-off-by: Youngeun Kwon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

erhoo82 reviewed Nov 2, 2024

View reviewed changes

erhoo82 requested review from denera and erhoo82 November 2, 2024 20:30

denera requested changes Nov 4, 2024

View reviewed changes

denera approved these changes Nov 14, 2024

View reviewed changes

denera approved these changes Nov 21, 2024

View reviewed changes

youngeunkwon0405 and others added 13 commits December 2, 2024 14:26

draft implementation

d71d734

Signed-off-by: Youngeun Kwon <[email protected]>

compile error fix

a20d9af

Signed-off-by: Youngeun Kwon <[email protected]>

fix compile error

a6316f7

Signed-off-by: Youngeun Kwon <[email protected]>

remove print

6d08c7b

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b89013

for more information, see https://pre-commit.ci

Edit comments

4c4075a

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ebcd9f

for more information, see https://pre-commit.ci

edit the bulk-overlap test case

9443df2

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ce05b33

for more information, see https://pre-commit.ci

add version guard

1fa00d0

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

786c254

for more information, see https://pre-commit.ci

add runtime version guard

9f93be4

Signed-off-by: Youngeun Kwon <[email protected]>

fix the version guard

e526aa9

Signed-off-by: Youngeun Kwon <[email protected]>

denera force-pushed the fdl_for_merge branch from 610027d to e526aa9 Compare December 2, 2024 20:26

denera merged commit 64126aa into NVIDIA:main Dec 2, 2024
13 of 14 checks passed

Improving communication overlap for the case of multi kernel queue usage #1308

Improving communication overlap for the case of multi kernel queue usage #1308

Uh oh!

Conversation

youngeunkwon0405 commented Nov 2, 2024

Description

Type of change

Changes

Checklist:

Uh oh!

youngeunkwon0405 commented Nov 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denera commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youngeunkwon0405 commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denera commented Nov 14, 2024

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

youngeunkwon0405 commented Nov 14, 2024

Uh oh!

denera commented Nov 15, 2024

Uh oh!

youngeunkwon0405 commented Nov 15, 2024

Uh oh!

youngeunkwon0405 commented Nov 21, 2024

Uh oh!

denera commented Nov 21, 2024

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

denera commented Nov 22, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

denera commented Nov 4, 2024 •

edited

Loading

youngeunkwon0405 commented Nov 7, 2024 •

edited

Loading