[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

sumukshashidhar · 2024-09-16T21:33:29Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I'm trying to run Llama-405B, FP8, on 4 nodes of 4xA40, each with ~44GB VRAM. Theoretically, speaking, this is plenty of VRAM for 405B, FP8, given that I only need 405GB VRAM total +/- 50GB during inference. However, I keep running into OOM issues, after all safetensors checkpoints have been loaded, which does not seem to make much sense to me.

I've tried to reduce cache, context length, etc, but that does not seem to have too much of an effect on the OOM.

Reproduction

On each of the nodes, I run the following command:

GLOO_SOCKET_IFNAME=eno12399np0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 16 --nccl-init-addr 172.22.224.17:20000 --nnodes 4 --node-rank 3 --disable-cuda-graph --kv-cache-dtype fp8_e5m2 --chunked-prefill-size 1024 --mem-fraction-static 0.9 --disable-disk-cache

Environment

I have 4x4xA40 nodes, for distributed inference. They're linked by a 25GBe backbone. All of them have the same environment. Here is one of them detailed below:

Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A40
GPU 0,1,2,3 Compute Capability: 8.6
CUDA_HOME: None
PyTorch: 2.4.0+cu121
sglang: 0.3.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.45.1
anthropic: 0.34.2
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     SYS     SYS     0,2,4,6,8,10    0               N/A
GPU1    NV4      X      SYS     SYS     0,2,4,6,8,10    0               N/A
GPU2    SYS     SYS      X      NV4     1,3,5,7,9,11    1               N/A
GPU3    SYS     SYS     NV4      X      1,3,5,7,9,11    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

The text was updated successfully, but these errors were encountered:

merrymercy · 2024-09-22T11:16:28Z

could you show the full log?
reduce --mem-fraction-static 0.9 to prevent OOM instead of increasing it. The default value is 0.8 in this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

sumukshashidhar commented Sep 16, 2024 •

edited

Loading

merrymercy commented Sep 22, 2024 •

edited

Loading

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

[Bug] Llama 405B FP8 causes OOM on 16xA40 #1439

Comments

sumukshashidhar commented Sep 16, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

merrymercy commented Sep 22, 2024 • edited Loading

sumukshashidhar commented Sep 16, 2024 •

edited

Loading

merrymercy commented Sep 22, 2024 •

edited

Loading