Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GPU checkpointing in nvproxy #11095

Closed
cweld510 opened this issue Oct 31, 2024 · 9 comments
Closed

Support for GPU checkpointing in nvproxy #11095

cweld510 opened this issue Oct 31, 2024 · 9 comments
Labels
area: gpu Issue related to sandboxed GPU access type: enhancement New feature or request

Comments

@cweld510
Copy link
Contributor

cweld510 commented Oct 31, 2024

Description

We're interested in some form of GPU checkpointing - is this something that the gvisor team plans on supporting at any point?

Generally, existing GPU checkpointing implementations described in papers like Singularity or Cricket intercept CUDA calls via LD_PRELOAD. Prior to a checkpoint, they record stateful calls in a log, which is stored at checkpoint time along with the contents of GPU memory. At restore time, GPU memory is reloaded and the log is replayed. Both frameworks have to do some of virtualization of device pointers as well.

It seems (perhaps naively) that a similar scheme might be possible within nvproxy, which already intercepts calls to the GPU driver. In theory, nvproxy could record a subset of calls made to the GPU driver and replay them at checkpoint-restore time, virtualizing file descriptors and device pointers as needed; and separately, support copying contents of GPU memory off the device to a file and back.

This is clearly complex. I'm curious if you all believe it to be viable and plan on exploring the scheme described above, or a different one, at any point?

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

No response

@cweld510 cweld510 added the type: enhancement New feature or request label Oct 31, 2024
@EtiennePerot
Copy link
Contributor

Have you looked at #10478 (which I believe was filed by from one of your colleagues :))?
I believe cuda-checkpoint should work well within gVisor now that NVIDIA has fixed the issue described in that bug, and should allow GPU checkpointing to work in gVisor without the complexity of recording and replaying CUDA calls.

@cweld510
Copy link
Contributor Author

Interesting, I assumed that NVIDIA hadn't fixed the issue since NVIDIA/cuda-checkpoint#4 is still open, but honestly, I haven't tried running cuda-checkpoint again recently on pytorch within gvisor. I will do that.

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label Oct 31, 2024
@ayushr2
Copy link
Collaborator

ayushr2 commented Oct 31, 2024

I would recommend trying the latest driver (R565 I believe).

@cweld510
Copy link
Contributor Author

cweld510 commented Nov 6, 2024

Thanks! I'll reply back when I've had a chance to try the latest driver. Really appreciate the help on this.

@tianyuzhou95
Copy link
Contributor

tianyuzhou95 commented Jan 23, 2025

I would recommend trying the latest driver (R565 I believe).

As currently the latest driver gvisor support is 560.35.03, I have tried this with cuda-checkpoint and gvisor C/R. Unfortunately I still came across the same error during runsc checkpoint as @ayushr2 mentioned here: encoding error: can't save with live nvproxy clients

I believe cuda-checkpoint should work well within gVisor now that NVIDIA has fixed the issue described in that bug

It seems cuda-checkpoint itself has not updated, does the latest nvidia driver fix this bug?

Could you please give more information about how to make it(pytorch + cuda-checkpoint + gvisor C/R) work? Or is there any branch I could try on the latest driver(565)?

cc @amysaq2023 @btw616


PS: for detail info

runsc: master branch with commit id: c238e15234feef339823ad328f7c1208d0b276d7
host kernel: 5.15.0-130-generic (ubuntu 22.04)
host nvidia driver: 560.35.05

runtime config

"runsc-gpu": {
        "path": "/usr/local/bin/runsc",
        "runtimeArgs": [
                "--debug-log=/tmp/runsc/",
                "--platform=systrap",
                "--nvproxy=true",
                "--nvproxy-driver-version=560.35.03"
        ]
},

how do i run vllm container:

sudo docker run --runtime=runsc-gpu --gpus all -d \
    -v ~/.cache/modelscope:/root/.cache/modelscope \
    -v /xxx/cuda-checkpoint/bin/x86_64_Linux:/cuda-cr --env "VLLM_USE_MODELSCOPE=True" \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model qwen/Qwen2.5-0.5B-Instruct \
    --dtype=half

how do i use cuda-checkpoint

sudo docker exec <cid> /cuda-cr/cuda-checkpoint --toggle --pid <vllm engine pid>

how do i use runsc checkpoint

sudo runsc --root=/var/run/docker/runtime-runc/moby checkpoint -image-path /path/to/image/ <cid>

@ayushr2
Copy link
Collaborator

ayushr2 commented Jan 23, 2025

@tianyuzhou95 It seems NVIDIA/cuda-checkpoint#4 is still not fixed in any releases I have tried. This (pytorch apps not being able to be checkpointed) is a cuda-checkpoint bug, not a gVisor one. Could you follow up with NVIDIA about timeline?

@tianyuzhou95
Copy link
Contributor

@ayushr2 Of course, I will follow Nvidia's fix for this. It looks like they plan to support it in early 2025. Thanks!

@ayushr2
Copy link
Collaborator

ayushr2 commented Jan 27, 2025

As per NVIDIA/cuda-checkpoint#4 (comment), NVML workloads (like Pytorch apps) should be checkpointable staring R570 drivers.

@ayushr2
Copy link
Collaborator

ayushr2 commented Jan 27, 2025

The recommended way of checkpointing CUDA applications in gVisor is by running the cuda-checkpoint binary inside the gVisor sandbox on all CUDA processes. Then using runsc checkpoint to generate the checkpoint image.

On restore, use runsc restore to restore the sandbox and then run cuda-checkpoint on all the CUDA processes again to unfreeze them.

@ayushr2 ayushr2 closed this as completed Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants