Support for GPU checkpointing in nvproxy

### Description

We're interested in some form of GPU checkpointing - is this something that the gvisor team plans on supporting at any point?

Generally, existing GPU checkpointing implementations described in papers like [Singularity](https://arxiv.org/pdf/2202.07848) or [Cricket](https://onlinelibrary.wiley.com/doi/epdf/10.1002/cpe.6474) intercept CUDA calls via `LD_PRELOAD`. Prior to a checkpoint, they record stateful calls in a log, which is stored at checkpoint time along with the contents of GPU memory. At restore time, GPU memory is reloaded and the log is replayed. Both frameworks have to do some of virtualization of device pointers as well.

It seems (perhaps naively) that a similar scheme might be possible within nvproxy, which already intercepts calls to the GPU driver. In theory, nvproxy could record a subset of calls made to the GPU driver and replay them at checkpoint-restore time, virtualizing file descriptors and device pointers as needed; and separately, support copying contents of GPU memory off the device to a file and back. 

This is clearly complex. I'm curious if you all believe it to be viable and plan on exploring the scheme described above, or a different one, at any point?

### Is this feature related to a specific bug?

_No response_

### Do you have a specific solution in mind?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for GPU checkpointing in nvproxy #11095

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for GPU checkpointing in nvproxy #11095

Description

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions