Replies: 2 comments
-
BTW the command is: sudo docker run \
--rm \
--name tt \
--gpus all \
--net=host \
--shm-size=1g \
--ulimit memlock=-1 \
-p 20000:20000 \
-p 30000:30000 \
-v $DATA_DIR/db24/workspace/llm_weights:/llm_weights \
--env "GLOO_SOCKET_IFNAME=bond0" \
--env "NCCL_SOCKET_IFNAME=bond0" \
--env "NCCL_DEBUG=TRACE" \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path /llm_weights/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 --host 0.0.0.0 --port 30000 --tp 4 --nccl-init-addr node0:20000 --nnodes 4 --node-rank $NR --disable-cuda-graph --mem-frac 0.9 |
Beta Was this translation helpful? Give feedback.
0 replies
-
we are working on implementing the pipeline parallelism. It will be available soon. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
currently, I run llama3.1 405b awq-int on 4 nodes with each of 4X40G GPU
first I use
--tp=16
, breaking GEMM into smaller block, but the inference speed is very slow (Decode batch. #running-req: 1, #token: 2451, token usage: 0.01, gen throughput (token/s): 4.23, #queue-req: 0
)then I try
--tp=4
, but loading model with out of gpu memory, what I try to achieve was using tensor parallelism within each node of 4 GPU card by NvLink, activations pass across nodesnext I try
--tp=8
, I found each node only using 2 GPU cards, so I guess--tp
with multiple nodes indicate how many cards to employeeI not sure sglang support this scenario: using tensor parallelism within each node, and model break into smaller part by each nodes like GPipe or PipeDream way, if this scenario is supported how to config ?
putting scenatio into graph are:
Beta Was this translation helpful? Give feedback.
All reactions