Add Gloo communication to turbomind #3362

irexyc · 2025-03-28T03:47:40Z

Usage

A two node example with tp=2, dp=1 and device_num=2

node 0

export CUDA_VISIBLE_DEVICES=0
lmdeploy serve api_server \
    Qwen/Qwen2.5-7B-Instruct \
    --server-port 29200 \
    --tp 2 \
    --dp 1 \
    --nnodes 2 \
    --node-rank 0 \
    --dist-init-addr 127.0.0.1:7888

node 1

export CUDA_VISIBLE_DEVICES=1
lmdeploy serve api_server \
    Qwen/Qwen2.5-7B-Instruct \
    --server-port 29201 \
    --tp 2 \
    --dp 1 \
    --nnodes 2 \
    --node-rank 1 \
    --dist-init-addr 127.0.0.1:7888

irexyc · 2025-04-10T12:53:45Z

oc evaluate diff.csv

src/turbomind/comm/gloo/CMakeLists.txt

src/turbomind/comm/gloo/gloo_comm.cc

lmdeploy/turbomind/turbomind.py

src/turbomind/comm/gloo/gloo_comm.cc

src/turbomind/triton_backend/llama/LlamaTritonModel.cc

src/turbomind/comm/gloo/CMakeLists.txt

lmdeploy/serve/async_engine.py

lmdeploy/turbomind/turbomind.py

src/turbomind/core/serdes.h

src/turbomind/triton_backend/llama/LlamaTritonModel.cc

lmdeploy/turbomind/turbomind.py

* init gloo support * use pytorch tcpstore * update gateway and support setting devices * fix build * use tm cfg instead of env * fix dp * fix lint * fix build * fix ci * update gloo version to match pytroch/v2.8.0-rc4 * simplify devices setup * change the size of engine_params_ to device_per_node * use dist_init_addr for init * remove unused * update * optimize serialization * buffer management * fix wait * remove constraint that each node must has attn_dp * add hybrid comm & optimize broadcast * add test & benchmark code * add ibverbs transport * remove grammar deps in irrelevant cmakelists * use serdes * hide hostcomm implementation details * skip serialize buffer of Request.outputs * fix try_pop * use default 30mins timeout * support loading model with 512 experts * remove unused * remove ex archive * use is_loading static var * fix dummy node logic * use large timeout for broadcast request * add comments to metrics * use hybrid comm as default for multi nodes * update inter comm split in hybrid comm * remove unused * fix lint

init gloo support

3899022

irexyc added the WIP label Mar 28, 2025

irexyc added 4 commits April 2, 2025 13:12

use pytorch tcpstore

916e44b

update gateway and support setting devices

bd5be0f

fix build

ecc2623

use tm cfg instead of env

edb4dfe

irexyc changed the title ~~[WIP] Add Gloo communication to turobmind~~ Add Gloo communication to turobmind Apr 10, 2025

irexyc removed the WIP label Apr 10, 2025

lvhan028 added the enhancement New feature or request label Apr 10, 2025

irexyc added 4 commits April 23, 2025 12:34

Merge remote-tracking branch 'origin/main' into gloo-comm

216ad15

fix dp

9b8d8a7

fix lint

22569cb

fix build

eb03cbf

lvhan028 requested a review from lzhangzz April 24, 2025 03:45

Merge remote-tracking branch 'origin/main' into gloo-comm

e0f2409

lvhan028 reviewed May 12, 2025

View reviewed changes

src/turbomind/comm/gloo/CMakeLists.txt Outdated Show resolved Hide resolved

lvhan028 reviewed May 12, 2025

View reviewed changes

src/turbomind/comm/gloo/gloo_comm.cc Outdated Show resolved Hide resolved

lvhan028 reviewed May 12, 2025

View reviewed changes

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved

lzhangzz reviewed Jun 27, 2025

View reviewed changes

irexyc added 10 commits July 11, 2025 05:39

Merge remote-tracking branch 'origin/main' into gloo-comm

aca00e1

fix ci

e78156d

update gloo version to match pytroch/v2.8.0-rc4

e06f256

Merge remote-tracking branch 'github/main' into gloo-comm

dce9eb7

simplify devices setup

63173f1

Merge remote-tracking branch 'github/main' into gloo-comm

386b411

change the size of engine_params_ to device_per_node

ef343f9

use dist_init_addr for init

875d747

remove unused

723d14d

update

24627d0

irexyc added 13 commits November 26, 2025 03:39

Merge remote-tracking branch 'github/main' into gloo-comm

6851447

optimize serialization

6121732

buffer management

889aba5

fix wait

083bfd4

remove constraint that each node must has attn_dp

be25f2d

add hybrid comm & optimize broadcast

6c77a59

add test & benchmark code

8d67213

add ibverbs transport

f5e815f

remove grammar deps in irrelevant cmakelists

983f41e

use serdes

29c27b0

hide hostcomm implementation details

fcb49cf

skip serialize buffer of Request.outputs

ec4f386

fix try_pop

fdd1438

lzhangzz changed the title ~~Add Gloo communication to turobmind~~ Add Gloo communication to turbomind Dec 11, 2025

irexyc added 2 commits December 12, 2025 05:52

use default 30mins timeout

d635c5e

support loading model with 512 experts

2dfb965

lzhangzz reviewed Dec 18, 2025

View reviewed changes

irexyc added 10 commits December 18, 2025 08:20

remove unused

d4bdd69

remove ex archive

62338b2

use is_loading static var

fb889df

fix dummy node logic

eccf76c

use large timeout for broadcast request

c6c77d7

add comments to metrics

f786edd

use hybrid comm as default for multi nodes

1a94ae1

update inter comm split in hybrid comm

b457e1b

remove unused

b387512

fix lint

b16d9ba

lzhangzz approved these changes Dec 22, 2025

View reviewed changes

lvhan028 merged commit def3052 into InternLM:main Dec 25, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gloo communication to turbomind #3362

Add Gloo communication to turbomind #3362

Uh oh!

irexyc commented Mar 28, 2025 •

edited

Loading

Uh oh!

irexyc commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Gloo communication to turbomind #3362

Add Gloo communication to turbomind #3362

Uh oh!

Conversation

irexyc commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

irexyc commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

irexyc commented Mar 28, 2025 •

edited

Loading