CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

fzyzcjy · 2024-12-28T12:18:17Z

Outdated Content

The test will fail because it uses LD_PRELOAD currently (to intercept and change logic of cudaMalloc and cudaFree). If the general logic looks good, I will further update this PR to handle this part (e.g. try to specify LD_PRELOAD automatically when creating the backend process.)

How to execute it

Suppose this branch of SGLang is at /path/to/sglang, then inside sglang's docker container, execute the following:

# install the torch_memory_saver (currently install from source, surely on pip later)
git clone https://github.com/fzyzcjy/torch_memory_saver
(cd torch_memory_saver && make reinstall)

cd /path/to/sglang
PYTHONPATH=$(pwd)/python LD_PRELOAD=/sgl-workspace/torch_memory_saver/torch_memory_saver_cpp.cpython-310-x86_64-linux-gnu.so python3 test/srt/test_release_gpu_occupation.py

Expected results are as follows. x is time, red color is memory consume. The low memory at the center is caused by temporarily release KV cache memory.

What's changed

Though the PR seems large, most are boilerplate.

Core:

Wrap things inside with primary_memory_saver.region(): TokenToKVPool.k_buffers/v_buffers, ModelRunner.model, ReqToTokenPool.req_to_token
Call primary_memory_saver.pause()/.resume(): At scheduler.py, Scheduler.release_gpu_occupation/resume_gpu_occupation

Others:

Add Engine.release_gpu_occupation/resume_gpu_occupation: Need to add multiple structs such as ReleaseGPUOccupationReqInput, and multiple forwarding boilerplate in Engine/TokenizerManager
Add a base class BaseCausalLM: Need to change all models' parent classes.
Add tests

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

# Conflicts: # python/sglang /srt/managers/tokenizer_manager.py

fzyzcjy and others added 30 commits December 26, 2024 13:16

empty struct

a5061cc

more

5a5651b

more

1ccf84c

simp

6e55282

more

5edcf5a

fix typing

35eb3ad

more

211550e

Merge branch 'feat/code_cleanup' into feat/memory_optimization

619aa19

more

95a8db9

more

5650a75

more

ecd3d9a

more

53573cc

fix

8f8bc3d

more

eaa9808

more

f3c948c

more

94e9ec8

more

2317150

Merge branch 'feat/code_cleanup' into feat/memory_optimization

bc56193

more

8042494

Merge branch 'main' into feat/code_cleanup

5063ff0

Merge branch 'main' into feat/memory_optimization

711b7de

cleanup

255251f

fix

7f737d5

Merge branch 'main' into feat/code_cleanup

db53dfa

Merge branch 'main' into feat/code_cleanup

9dcba7b

Merge branch 'feat/code_cleanup' into feat/memory_optimization

adff864

# Conflicts: # python/sglang /srt/managers/tokenizer_manager.py

Merge branch 'feat/memory_optimization' into feat/memory_saver

23ff620

more

59989d7

more

08d5900

enable cudagraph

bf890f8

fzyzcjy added 4 commits December 31, 2024 09:25

more

79c5ebc

Merge branch 'feat/shell_script' into feat/memory_saver

833b423

bump

6000d07

fmt

3870214

fzyzcjy mentioned this pull request Dec 31, 2024

Tiny update scripts to fail fast #2672

Merged

3 tasks

fzyzcjy and others added 9 commits December 31, 2024 09:37

more

b8741ce

more

98725f1

bump

e866ec9

bump

199d286

fmt

169e02c

bump

fa81f27

fmt

ae0c589

Merge branch 'main' into feat/memory_saver

1e7b1d4

more

0b5e26a

fzyzcjy changed the title ~~Allow release memory and later resume (compatible with CUDA graph)~~ CUDA-graph-compatible releasing and resuming of KV cache and model weight memory Dec 31, 2024

fzyzcjy changed the title ~~CUDA-graph-compatible releasing and resuming of KV cache and model weight memory~~ CUDA-graph-compatible releasing and resuming KV cache and model weight memory Dec 31, 2024

optional dep

d9ad0e9

fzyzcjy force-pushed the feat/memory_saver branch from 9a919d9 to d9ad0e9 Compare December 31, 2024 05:30

fzyzcjy and others added 12 commits December 31, 2024 13:35

more

3b30168

more

e8e2375

more

4ef6d46

more

ceec579

more

2bcac8e

more

7749fbf

more

0de1f1a

more

9e3d55d

more

b2a4804

Merge branch 'main' into feat/memory_saver

3bb197e

fmt

809be14

more

b7f795d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

fzyzcjy commented Dec 28, 2024 •

edited

Loading

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Are you sure you want to change the base?

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Conversation

fzyzcjy commented Dec 28, 2024 • edited Loading

How to execute it

What's changed

Checklist

fzyzcjy commented Dec 28, 2024 •

edited

Loading