Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Open
wants to merge 122 commits into
base: main
Choose a base branch
from

Conversation

fzyzcjy
Copy link
Contributor

@fzyzcjy fzyzcjy commented Dec 28, 2024

Related: #2542 and #2583

Outdated Content

The test will fail because it uses LD_PRELOAD currently (to intercept and change logic of cudaMalloc and cudaFree). If the general logic looks good, I will further update this PR to handle this part (e.g. try to specify LD_PRELOAD automatically when creating the backend process.)

How to execute it

Suppose this branch of SGLang is at /path/to/sglang, then inside sglang's docker container, execute the following:

# install the torch_memory_saver (currently install from source, surely on pip later)
git clone https://github.com/fzyzcjy/torch_memory_saver
(cd torch_memory_saver && make reinstall)

cd /path/to/sglang
PYTHONPATH=$(pwd)/python LD_PRELOAD=/sgl-workspace/torch_memory_saver/torch_memory_saver_cpp.cpython-310-x86_64-linux-gnu.so python3 test/srt/test_release_gpu_occupation.py

Expected results are as follows. x is time, red color is memory consume. The low memory at the center is caused by temporarily release KV cache memory.

image

What's changed

Though the PR seems large, most are boilerplate.

Core:

  • Wrap things inside with primary_memory_saver.region(): TokenToKVPool.k_buffers/v_buffers, ModelRunner.model, ReqToTokenPool.req_to_token
  • Call primary_memory_saver.pause()/.resume(): At scheduler.py, Scheduler.release_gpu_occupation/resume_gpu_occupation

Others:

  • Add Engine.release_gpu_occupation/resume_gpu_occupation: Need to add multiple structs such as ReleaseGPUOccupationReqInput, and multiple forwarding boilerplate in Engine/TokenizerManager
  • Add a base class BaseCausalLM: Need to change all models' parent classes.
  • Add tests

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@fzyzcjy fzyzcjy mentioned this pull request Dec 31, 2024
3 tasks
@fzyzcjy fzyzcjy changed the title Allow release memory and later resume (compatible with CUDA graph) CUDA-graph-compatible releasing and resuming of KV cache and model weight memory Dec 31, 2024
@fzyzcjy fzyzcjy changed the title CUDA-graph-compatible releasing and resuming of KV cache and model weight memory CUDA-graph-compatible releasing and resuming KV cache and model weight memory Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants