Running DeepSeek-R1-AWQ on single A100 with vllm #13271

mjp9527 · 2025-02-14T07:58:21Z

mjp9527
Feb 14, 2025

I discovered an interesting thing, by placing the MOE experts' weights in the CPU's pinned memory during loading, Triton kernels for FusedMOE can support pinned CPU tensors. This allows the DeepSeek-R1 model to run on a GPU with only 40GB of VRAM. I successfully ran the AWQ model on an A800 GPU, although the performance was relatively poor. Model is cognitivecomputations/DeepSeek-R1-AWQ

dailingcs · 2025-02-17T08:33:31Z

dailingcs
Feb 17, 2025

Can you share your command? Thank you!

0 replies

jamin85cheng · 2025-03-11T12:40:56Z

jamin85cheng
Mar 11, 2025

Can you share your command? Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

Running DeepSeek-R1-AWQ on single A100 with vllm #13271

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running DeepSeek-R1-AWQ on single A100 with vllm #13271

Uh oh!

mjp9527 Feb 14, 2025

Replies: 2 comments

Uh oh!

dailingcs Feb 17, 2025

Uh oh!

jamin85cheng Mar 11, 2025

mjp9527
Feb 14, 2025

dailingcs
Feb 17, 2025

jamin85cheng
Mar 11, 2025