Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Proposal: Releasing SGLang memory when idle #2583

Open
fzyzcjy opened this issue Dec 26, 2024 · 12 comments
Open

[Feature] Proposal: Releasing SGLang memory when idle #2583

fzyzcjy opened this issue Dec 26, 2024 · 12 comments

Comments

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 26, 2024

Proposal 1: Release KV cache when engine is idle

When using SGLang for generation in a training pipeline (such as PPO), at the phase of running HuggingFace model forward/backward, SGLang currently needs to take a lot of memory even though it does not use it. It would be great to make SGLang use as little memory as possible when it is idle.

Example usage cases:

  • Suppose we run OpenRLHF on 8xH100, the currently we may allocate 4xH100 for vllm/SGLang and another 4xH100 for HF model (thanks @zhaochenyang20 for providing this usage scenario).
    • If we make SGLang use little memory when idle, then we can run the same experiment on half number of GPUs (4xH100) by putting those SGLang engines on the same GPUs as HF models.
  • Suppose we run PPO on 1xH100 for a 7B model with Adam offloading (thanks @zhaochenyang20 for providing this usage scenario). Then policy (7Bx2) + critic (7Bx2) + ref (7Bx2) + reward (7Bx2) already takes 56B. The current SGLang needs 7Bx2 for weights and some memory for KV cache, thus it may not easy to fit the 80GB card.
    • If we implement the proposal 1 and proposal 2, we will have roughly 24B room for HF model forward/backward, and 24B room for SGLang to do generation. (We may have more if quantizing ref & reward model though not sure whether it will work.)
  • Suppose we run OpenRLHF on 1x4090 for a 0.5B model, then the memory is also very limited like the 1xH100 & 7B model case.
    • If the proposals are successfully implemented, we may be able to run in such scenarios.

One potential optimization for memory is to release KV cache:

  • When the training pipeline does not need SGLang (e.g. doing HF model forward/backward in PPO), let SGLang be in a "paused" mode, and later "resume" it when we need to use SGLang to do generation.
  • When SGLang enter "paused" mode, release the KV cache (link to hacky experiment) by simply deleting the tensors.
  • When SGLang later "resume", re-create the KV cache tensors.

I will PR for this as soon as having some time (hopefully soon).

Proposal 2: Release model weights when engine is paused

Another part of memory occupied by SGLang is the model weights. Thus one potential solution is:

  • When SGLang is paused, we delete the model weights (e.g. maybe by model.to('meta'), not tested) to release memory
  • When SGLang is resumed, we recreate empty model weights (e.g. by model.to_empty(device='cuda'))
  • Then, users should do update_weight to provide new weights to SGLang.
    • This is not an overhead, because during some RLHF processes, we already need to call update_weight before a generate to use the latest updated weights instead of outdated weights.

Proposal 3: Update SGLang model weights when on same GPU

Currently, when we do update_weight to copy HF model weight to SGLang model weight, it seems we will use the torch broadcast operation. However, when users put HuggingFace model and SGLang model on the same GPU, it may be possible to use more lightweight solutions to avoid the overhead of broadcast.

To be more specific:

  • Initialization
    • Users provide their HF model to SGLang Engine
    • SGLang shares the tensors of this model to the SGLang runtime process
  • Weight update
    • Users trigger "update weight from the previously provided HF model" operation
    • SGLang runtime process read the aforementioned tensor to update the SGLang model weights

This is just a rough draft and there can be more details. For example, if it is possible for the tensor objects in HF model to change, then we may need to send the new tensors across processes again.

Related: #2542
cc @zhaochenyang20

@fzyzcjy fzyzcjy changed the title [Feature] More detailed proposal of releasing SGLang memory when idle [Feature] Proposal: Releasing SGLang memory when idle Dec 26, 2024
@zhaochenyang20
Copy link
Collaborator

If we make SGLang use little memory when idle, then we can run the same experiment on half the number of GPUs (4xH100) by putting those SGLang engines on the same GPUs as HF models.

I am not sure how we can save half of the GPU. That is, can we allocate the inference engine and training engine on the same GPU? This means that if weights update ends, we should release the VRAM of training and give it to the inference engine. When sampling ends, we should release the VRAM of inference and give it to the training engine.

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Dec 26, 2024

If we implement the proposal 1 and proposal 2, we will have roughly 24B room for HF model forward/backward, and 24B room for SGLang to do generation. (We may have more if quantizing ref & reward model though not sure whether it will work.)

I am curious. Could we also release some of the VRAM of the training engine? Like releasing the reward model and reference model, since they are fixed and we can always reload them even from disk. For the policy model and critic model, I don't think that we can offload the models, since the weights are updated and we need to torch.save to put the weights to disk at least, or could we offload it to CPU?

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

I am not sure how we can save half of the GPU. That is, can we allocate the inference engine and training engine on the same GPU?

Yes I think so

This means that if weights update ends, we should release the VRAM of training and give it to the inference engine.

Yes, we can release VRAM of activations, gradients, etc. (Though may not be able to release model memory - but that does not look large)

When sampling ends, we should release the VRAM of inference and give it to the training engine.

Yes

I am curious. Could we also release some of the VRAM of the training engine? Like releasing the reward model and reference model, since they are fixed and we can always reload them even from disk. For the policy model and critic model, I don't think that we can offload the models, since the weights are updated and we need to torch.save to put the weights to disk at least, or could we offload it to CPU?

Surely yes, but I am worried reloading from CPU may not be fast, since it is bounded by CPU-GPU bandwidth. (Reloading from disk is much slower) I have not done experiments on H100 before, so to be honest I do not know the numbers. If such reload happens infrequently, then we can surely do that :)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

Another idea: We may even be able to delete policy model weights during generation phase, because the weights are already inside the SGLang model. To be more specific:

  • Train phase - run forward/backward on HF model
  • Update SGLang model weight from HF model
  • Delete HF model weight tensors to free space
  • Generation phase - run SGLang generate
  • Update HF model weight from SGLang model
  • Delete SGLang model weight tensor to free space
  • ... loop again

If the model weight is in bf16, then this is no problem; if in fp32, we may have some extra work needed.

There are some engineering work though - now we only have "hf to sglang" weight conversion logic, and we have to implement the other side around.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...

@zhaochenyang20
Copy link
Collaborator

Surely yes, but I am worried reloading from CPU may not be fast, since it is bounded by CPU-GPU bandwidth. (Reloading from disk is much slower) I have not done experiments on H100 before, so to be honest I do not know the numbers. If such reload happens infrequently, then we can surely do that :)

We can have a try. Do you need H100 access right now? We can provide this to you immediately.

@zhaochenyang20
Copy link
Collaborator

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...

Not sure. We rarely do this as I know.

@zhaochenyang20
Copy link
Collaborator

Another idea: We may even be able to delete policy model weights during generation phase, because the weights are already inside the SGLang model. To be more specific:

  • Train phase - run forward/backward on HF model
  • Update SGLang model weight from HF model
  • Delete HF model weight tensors to free space
  • Generation phase - run SGLang generate
  • Update HF model weight from SGLang model
  • Delete SGLang model weight tensor to free space
  • ... loop again

If the model weight is in bf16, then this is no problem; if in fp32, we may have some extra work needed.

There are some engineering work though - now we only have "hf to sglang" weight conversion logic, and we have to implement the other side around.

Could be amazing and crazy. We save GPU to an extreme stage.

In my sense, loop this:

  1. Rollout using SGLang engine and release the VRAM.
  2. Make experiences by policy, reference, reward, and critic. And release the VRAM of reference and reward, since they won't be updated.
  3. update policy and critic. And keep critic but send the policy to SGLang (only have one policy at any time).

I think this is crazy.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

We can have a try. Do you need H100 access right now? We can provide this to you immediately.

Thank you! I do not have full slots of time probably in several days (thus only able to do things like easy cleanup for SGLang which can utilize fragmented time and does not need a full slot of time), but will ping you when needing H100.

Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded...
Not sure. We rarely do this as I know.

Get it. I saw people (e.g. https://github.com/unslothai/unsloth) often use QLoRA, which even quantizes model in 4bit, and I know inference engine often work well in fp8, thus wondering whether that is possible (or will be possible in the future).

Could be amazing and crazy. We save GPU to an extreme stage.
In my sense, loop this:

Yes!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

@zhaochenyang20 PR submitted: #2588. Currently only write down some unit tests test/srt/test_release_gpu_occupation.py and make them pass, will do more later.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 29, 2024

Updates are shown in #2542

@ZSL98
Copy link

ZSL98 commented Dec 30, 2024

Hi! The proposals you guys are discussing have some concrete evidence in https://github.com/volcengine/verl. Training and rollout engines can be placed on the same GPU with proper weight offloading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants