[Feature] Proposal: Releasing SGLang memory when idle #2583

fzyzcjy · 2024-12-26T02:23:14Z

Proposal 1: Release KV cache when engine is idle

When using SGLang for generation in a training pipeline (such as PPO), at the phase of running HuggingFace model forward/backward, SGLang currently needs to take a lot of memory even though it does not use it. It would be great to make SGLang use as little memory as possible when it is idle.

Example usage cases:

Suppose we run OpenRLHF on 8xH100, the currently we may allocate 4xH100 for vllm/SGLang and another 4xH100 for HF model (thanks @zhaochenyang20 for providing this usage scenario).
- If we make SGLang use little memory when idle, then we can run the same experiment on half number of GPUs (4xH100) by putting those SGLang engines on the same GPUs as HF models.
Suppose we run PPO on 1xH100 for a 7B model with Adam offloading (thanks @zhaochenyang20 for providing this usage scenario). Then policy (7Bx2) + critic (7Bx2) + ref (7Bx2) + reward (7Bx2) already takes 56B. The current SGLang needs 7Bx2 for weights and some memory for KV cache, thus it may not easy to fit the 80GB card.
- If we implement the proposal 1 and proposal 2, we will have roughly 24B room for HF model forward/backward, and 24B room for SGLang to do generation. (We may have more if quantizing ref & reward model though not sure whether it will work.)
Suppose we run OpenRLHF on 1x4090 for a 0.5B model, then the memory is also very limited like the 1xH100 & 7B model case.
- If the proposals are successfully implemented, we may be able to run in such scenarios.

One potential optimization for memory is to release KV cache:

When the training pipeline does not need SGLang (e.g. doing HF model forward/backward in PPO), let SGLang be in a "paused" mode, and later "resume" it when we need to use SGLang to do generation.
When SGLang enter "paused" mode, release the KV cache (link to hacky experiment) by simply deleting the tensors.
When SGLang later "resume", re-create the KV cache tensors.

I will PR for this as soon as having some time (hopefully soon).

Proposal 2: Release model weights when engine is paused

Another part of memory occupied by SGLang is the model weights. Thus one potential solution is:

When SGLang is paused, we delete the model weights (e.g. maybe by model.to('meta'), not tested) to release memory
When SGLang is resumed, we recreate empty model weights (e.g. by model.to_empty(device='cuda'))
Then, users should do update_weight to provide new weights to SGLang.
- This is not an overhead, because during some RLHF processes, we already need to call update_weight before a generate to use the latest updated weights instead of outdated weights.

Proposal 3: Update SGLang model weights when on same GPU

Currently, when we do update_weight to copy HF model weight to SGLang model weight, it seems we will use the torch broadcast operation. However, when users put HuggingFace model and SGLang model on the same GPU, it may be possible to use more lightweight solutions to avoid the overhead of broadcast.

To be more specific:

Initialization
- Users provide their HF model to SGLang Engine
- SGLang shares the tensors of this model to the SGLang runtime process
Weight update
- Users trigger "update weight from the previously provided HF model" operation
- SGLang runtime process read the aforementioned tensor to update the SGLang model weights

This is just a rough draft and there can be more details. For example, if it is possible for the tensor objects in HF model to change, then we may need to send the new tensors across processes again.

Related: #2542
cc @zhaochenyang20

The text was updated successfully, but these errors were encountered:

zhaochenyang20 · 2024-12-26T05:25:32Z

If we make SGLang use little memory when idle, then we can run the same experiment on half the number of GPUs (4xH100) by putting those SGLang engines on the same GPUs as HF models.

I am not sure how we can save half of the GPU. That is, can we allocate the inference engine and training engine on the same GPU? This means that if weights update ends, we should release the VRAM of training and give it to the inference engine. When sampling ends, we should release the VRAM of inference and give it to the training engine.

zhaochenyang20 · 2024-12-26T05:34:15Z

If we implement the proposal 1 and proposal 2, we will have roughly 24B room for HF model forward/backward, and 24B room for SGLang to do generation. (We may have more if quantizing ref & reward model though not sure whether it will work.)

I am curious. Could we also release some of the VRAM of the training engine? Like releasing the reward model and reference model, since they are fixed and we can always reload them even from disk. For the policy model and critic model, I don't think that we can offload the models, since the weights are updated and we need to torch.save to put the weights to disk at least, or could we offload it to CPU?

fzyzcjy · 2024-12-26T05:41:15Z

I am not sure how we can save half of the GPU. That is, can we allocate the inference engine and training engine on the same GPU?

Yes I think so

This means that if weights update ends, we should release the VRAM of training and give it to the inference engine.

Yes, we can release VRAM of activations, gradients, etc. (Though may not be able to release model memory - but that does not look large)

When sampling ends, we should release the VRAM of inference and give it to the training engine.

Yes

I am curious. Could we also release some of the VRAM of the training engine? Like releasing the reward model and reference model, since they are fixed and we can always reload them even from disk. For the policy model and critic model, I don't think that we can offload the models, since the weights are updated and we need to torch.save to put the weights to disk at least, or could we offload it to CPU?

Surely yes, but I am worried reloading from CPU may not be fast, since it is bounded by CPU-GPU bandwidth. (Reloading from disk is much slower) I have not done experiments on H100 before, so to be honest I do not know the numbers. If such reload happens infrequently, then we can surely do that :)

fzyzcjy · 2024-12-26T05:45:10Z

Another idea: We may even be able to delete policy model weights during generation phase, because the weights are already inside the SGLang model. To be more specific:

Train phase - run forward/backward on HF model
Update SGLang model weight from HF model
Delete HF model weight tensors to free space
Generation phase - run SGLang generate
Update HF model weight from SGLang model
Delete SGLang model weight tensor to free space
... loop again

If the model weight is in bf16, then this is no problem; if in fp32, we may have some extra work needed.

There are some engineering work though - now we only have "hf to sglang" weight conversion logic, and we have to implement the other side around.

fzyzcjy · 2024-12-26T05:47:15Z

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...

zhaochenyang20 · 2024-12-26T05:54:52Z

Surely yes, but I am worried reloading from CPU may not be fast, since it is bounded by CPU-GPU bandwidth. (Reloading from disk is much slower) I have not done experiments on H100 before, so to be honest I do not know the numbers. If such reload happens infrequently, then we can surely do that :)

We can have a try. Do you need H100 access right now? We can provide this to you immediately.

zhaochenyang20 · 2024-12-26T05:56:12Z

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...

Not sure. We rarely do this as I know.

zhaochenyang20 · 2024-12-26T05:59:15Z

Another idea: We may even be able to delete policy model weights during generation phase, because the weights are already inside the SGLang model. To be more specific:

Train phase - run forward/backward on HF model

Update SGLang model weight from HF model

Delete HF model weight tensors to free space

Generation phase - run SGLang generate

Update HF model weight from SGLang model

Delete SGLang model weight tensor to free space

... loop again

If the model weight is in bf16, then this is no problem; if in fp32, we may have some extra work needed.

There are some engineering work though - now we only have "hf to sglang" weight conversion logic, and we have to implement the other side around.

Could be amazing and crazy. We save GPU to an extreme stage.

In my sense, loop this:

Rollout using SGLang engine and release the VRAM.
Make experiences by policy, reference, reward, and critic. And release the VRAM of reference and reward, since they won't be updated.
update policy and critic. And keep critic but send the policy to SGLang (only have one policy at any time).

I think this is crazy.

fzyzcjy · 2024-12-26T06:16:43Z

We can have a try. Do you need H100 access right now? We can provide this to you immediately.

Thank you! I do not have full slots of time probably in several days (thus only able to do things like easy cleanup for SGLang which can utilize fragmented time and does not need a full slot of time), but will ping you when needing H100.

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...
Not sure. We rarely do this as I know.

Get it. I saw people (e.g. https://github.com/unslothai/unsloth) often use QLoRA, which even quantizes model in 4bit, and I know inference engine often work well in fp8, thus wondering whether that is possible (or will be possible in the future).

Could be amazing and crazy. We save GPU to an extreme stage.
In my sense, loop this:

Yes!

fzyzcjy · 2024-12-26T07:08:41Z

@zhaochenyang20 PR submitted: #2588. Currently only write down some unit tests test/srt/test_release_gpu_occupation.py and make them pass, will do more later.

fzyzcjy · 2024-12-29T00:45:54Z

Updates are shown in #2542

ZSL98 · 2024-12-30T09:25:50Z

Hi! The proposals you guys are discussing have some concrete evidence in https://github.com/volcengine/verl. Training and rollout engines can be placed on the same GPU with proper weight offloading.

fzyzcjy changed the title ~~[Feature] More detailed proposal of releasing SGLang memory when idle~~ [Feature] Proposal: Releasing SGLang memory when idle Dec 26, 2024

zhaochenyang20 self-assigned this Dec 26, 2024

zhaochenyang20 added high priority feature labels Dec 26, 2024

This was referenced Dec 26, 2024

[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

Open

Add update_weights_from_tensor #2631

Merged

fzyzcjy mentioned this issue Dec 29, 2024

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Proposal: Releasing SGLang memory when idle #2583

[Feature] Proposal: Releasing SGLang memory when idle #2583

fzyzcjy commented Dec 26, 2024 •

edited by zhaochenyang20

Loading

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 29, 2024

ZSL98 commented Dec 30, 2024

[Feature] Proposal: Releasing SGLang memory when idle #2583

[Feature] Proposal: Releasing SGLang memory when idle #2583

Comments

fzyzcjy commented Dec 26, 2024 • edited by zhaochenyang20 Loading

Proposal 1: Release KV cache when engine is idle

Proposal 2: Release model weights when engine is paused

Proposal 3: Update SGLang model weights when on same GPU

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024 • edited Loading

fzyzcjy commented Dec 26, 2024 • edited Loading

fzyzcjy commented Dec 26, 2024 • edited Loading

fzyzcjy commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 29, 2024

ZSL98 commented Dec 30, 2024

fzyzcjy commented Dec 26, 2024 •

edited by zhaochenyang20

Loading

zhaochenyang20 commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024 •

edited

Loading