hostinet: support checkpoint/restore#13404
Conversation
8a82549 to
8076308
Compare
I think S/R should properly save and restore listening sockets. For connected sockets, it should be better to return ECONNABORTED rather than EBADF. |
|
Thanks for the quick feedback. Got it. I am def a bit new to this area and originally followed the unix socket close on save shape because i was thinking no host socket state crosses the checkpoint boundary and it being the simplest contract to reason about. I did realize that the guest fd being still open in the sentry's fd table after restore means I'll update the PR. 2 quick questions:
|
8076308 to
c00626b
Compare
Host networking did not have checkpoint/restore support. A sandbox running with `--network=host` could not be checkpointed because `hostinet.Stack` was not savable, and restore did not define what should happen to open host sockets. This change makes the hostinet stack savable and defines that restore contract. Host socket fds cannot cross a checkpoint boundary, so `Socket.fd` is tagged `state:"nosave"` and save does not touch host sockets. With checkpoint `--leave-running`, the original sandbox keeps serving traffic on its existing connections. If save fails, the sandbox is left unharmed. Without `--leave-running`, the sandbox exits after save and the host kernel closes its sockets. That sends `FIN` or `RST` so remote peers do not hang on half-open connections. Listening sockets carry no connection state, so they are restored. Save records the bound address, the backlog, and the socket options needed to re-bind, and restore re-creates the host socket with `socket`, `bind`, and `listen` before tasks resume. Connections that were pending in the backlog at checkpoint time are lost. If the listen address cannot be bound on the restoring host, the restore fails. Sockets that were connected at checkpoint time cannot be restored. `afterLoad` sets their fd to `-1` and host socket operations return `ECONNABORTED`, which describes what was actually lost, since the guest file descriptor itself remains open in the sentry. `Readiness` reports hangup and error so `epoll` waiters wake immediately, reading `SO_ERROR` after `EPOLLERR` succeeds and returns `ECONNABORTED`, and such sockets appear as closed in `/proc/net/tcp`. Applications are expected to reconnect after restore. The stack's host-derived state is not saved. This includes `/proc/net/dev`, `/proc/net/snmp`, TCP buffer sizes, and allowed socket types. During restore, runsc configures a fresh hostinet stack before seccomp filters are installed. `ReplaceConfig` then copies the destination host configuration into the deserialized stack and transfers ownership of the fresh proc net handles. The restored sandbox therefore uses the destination host configuration and limits. A hostinet checkpoint contains no netstack state and a netstack checkpoint contains no hostinet state, so the checkpoint now records the network type in its metadata. Restore fails with a clear error when the network stack kind differs between checkpoint and restore, instead of panicking inside `ReplaceConfig` during kernel load. Sandbox and none networking both use netstack and remain interchangeable. Checkpoints that predate the metadata key are treated as netstack checkpoints. The hostinet `save_resume` syscall tests are enabled because save does not touch host sockets. The save and restore syscall tests remain excluded because they expect connected sockets to keep working after restore. Container tests cover `ECONNABORTED` on restored TCP and UDP sockets, `epoll` and `SO_ERROR` behavior, a blocked accept that completes against the re-created listener, remote peer close after checkpoint, continued service after checkpoint `--leave-running`, and the error when restoring a host networking checkpoint with sandbox networking. The focused hostinet unit tests and the hostinet `save_resume` syscall tests, including the listener-heavy `accept_bind` and `socket_inet_loopback` suites, pass on `x86_64`. The checkpoint/restore user guide documents the restored socket behavior.
c00626b to
3718b5c
Compare
Host networking did not have checkpoint/restore support. A sandbox running with
--network=hostcould not be checkpointed becausehostinet.Stackwas not savable, and restore did not define what should happen to open host sockets.This change makes the hostinet stack savable and defines that restore contract. Host socket fds cannot cross a checkpoint boundary, so
Socket.fdis taggedstate:"nosave"and save does not touch host sockets. With checkpoint--leave-running, the original sandbox keeps serving traffic on its existing connections. If save fails, the sandbox is left unharmed.Without
--leave-running, the sandbox exits after save and the host kernel closes its sockets. That sendsFINorRSTso remote peers do not hang on half-open connections.In the restored sandbox,
afterLoadsets each socket fd to-1. Host syscalls then fail withEBADF.Readinessreports hangup and error soepollwaiters wake immediately, and tasks that were blocked in I/O seeEBADFwhen their syscall restarts.State()reports no TCP state for these sockets rather than logging a warning on every/proc/net/tcpread. Applications are expected to reconnect after restore.The stack's host-derived state is not saved. This includes
/proc/net/dev,/proc/net/snmp, TCP buffer sizes, and allowed socket types. During restore, runsc configures a fresh hostinet stack before seccomp filters are installed.ReplaceConfigthen copies the destination host configuration into the deserialized stack and transfers ownership of the fresh proc net handles. The restored sandbox therefore uses the destination host configuration and limits.A hostinet checkpoint contains no netstack state and a netstack checkpoint contains no hostinet state, so the checkpoint now records the network type in its metadata. Restore fails with a clear error when the network stack kind differs between checkpoint and restore, instead of panicking inside
ReplaceConfigduring kernel load. Sandbox and none networking both use netstack and remain interchangeable. Checkpoints taken before the network type was recorded skip this check.The hostinet
save_resumesyscall tests are enabled because save does not touch host sockets. The save and restore syscall tests remain excluded because they expect sockets to keep working after restore.Container tests cover restored socket behavior for TCP, UDP,
epoll, and a blocked accept. They also cover remote peer close after checkpoint, continued service after checkpoint--leave-running, and the error when restoring a host networking checkpoint with sandbox networking. The focused hostinet unit tests and the hostinetsave_resumesyscall tests pass onx86_64. The checkpoint/restore user guide documents the restored socket behavior.Fixes #6243