-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Proposal: Releasing SGLang memory when idle #2583
Comments
I am not sure how we can save half of the GPU. That is, can we allocate the inference engine and training engine on the same GPU? This means that if weights update ends, we should release the VRAM of training and give it to the inference engine. When sampling ends, we should release the VRAM of inference and give it to the training engine. |
I am curious. Could we also release some of the VRAM of the training engine? Like releasing the reward model and reference model, since they are fixed and we can always reload them even from disk. For the policy model and critic model, I don't think that we can offload the models, since the weights are updated and we need to |
Yes I think so
Yes, we can release VRAM of activations, gradients, etc. (Though may not be able to release model memory - but that does not look large)
Yes
Surely yes, but I am worried reloading from CPU may not be fast, since it is bounded by CPU-GPU bandwidth. (Reloading from disk is much slower) I have not done experiments on H100 before, so to be honest I do not know the numbers. If such reload happens infrequently, then we can surely do that :) |
Another idea: We may even be able to delete policy model weights during generation phase, because the weights are already inside the SGLang model. To be more specific:
If the model weight is in bf16, then this is no problem; if in fp32, we may have some extra work needed. There are some engineering work though - now we only have "hf to sglang" weight conversion logic, and we have to implement the other side around. |
Btw, have you tried quantized ref model and reward model (to save memory)? I am wondering whether that's possible, or maybe the performance will be severely degraded... |
We can have a try. Do you need H100 access right now? We can provide this to you immediately. |
Not sure. We rarely do this as I know. |
Could be amazing and crazy. We save GPU to an extreme stage. In my sense, loop this:
I think this is crazy. |
Thank you! I do not have full slots of time probably in several days (thus only able to do things like easy cleanup for SGLang which can utilize fragmented time and does not need a full slot of time), but will ping you when needing H100.
Get it. I saw people (e.g. https://github.com/unslothai/unsloth) often use QLoRA, which even quantizes model in 4bit, and I know inference engine often work well in fp8, thus wondering whether that is possible (or will be possible in the future).
Yes! |
@zhaochenyang20 PR submitted: #2588. Currently only write down some unit tests |
Updates are shown in #2542 |
Hi! The proposals you guys are discussing have some concrete evidence in https://github.com/volcengine/verl. Training and rollout engines can be placed on the same GPU with proper weight offloading. |
Proposal 1: Release KV cache when engine is idle
When using SGLang for generation in a training pipeline (such as PPO), at the phase of running HuggingFace model forward/backward, SGLang currently needs to take a lot of memory even though it does not use it. It would be great to make SGLang use as little memory as possible when it is idle.
Example usage cases:
One potential optimization for memory is to release KV cache:
I will PR for this as soon as having some time (hopefully soon).
Proposal 2: Release model weights when engine is paused
Another part of memory occupied by SGLang is the model weights. Thus one potential solution is:
model.to('meta')
, not tested) to release memorymodel.to_empty(device='cuda')
)update_weight
to provide new weights to SGLang.update_weight
before agenerate
to use the latest updated weights instead of outdated weights.Proposal 3: Update SGLang model weights when on same GPU
Currently, when we do
update_weight
to copy HF model weight to SGLang model weight, it seems we will use the torchbroadcast
operation. However, when users put HuggingFace model and SGLang model on the same GPU, it may be possible to use more lightweight solutions to avoid the overhead ofbroadcast
.To be more specific:
This is just a rough draft and there can be more details. For example, if it is possible for the tensor objects in HF model to change, then we may need to send the new tensors across processes again.
Related: #2542
cc @zhaochenyang20
The text was updated successfully, but these errors were encountered: