Skip to content

Latest commit

 

History

History
389 lines (300 loc) · 16.8 KB

0009-lifetime-event_loop.md

File metadata and controls

389 lines (300 loc) · 16.8 KB

Summary

Integrate the Crystal event loop directly with system selectors, epoll (Linux, Android) and kqueue (BSDs, macOS) instead of going through libevent.

Motivation

Direct integration with the system selectors enables a more performant implementation thanks to a design change. Going from ad-hoc event subscriptions to lifetime subscriptions per file descriptor (fd) reduces overhead per operation.

Dropping libevent also removes an external runtime dependency from Crystal programs.

This also prepares the event loop foundation for implementing execution contexts from RFC #0002.

Guide-level explanation

This new event loop driver builds on top of the event loop refactor from RFC #0007. It plugs right into the runtime and does not require any changes in user code, even for direct consumers of the Crystal::EventLoop API.

The new event loop driver is enabled automatically on supported targets (see Availability). The libevent driver serves as a fallback for unsupported targets and can be opted-in with the compile-time flag -Deventloop=libevent.

Design

The logic of the event loop doesn't change much from the one based on libevent:

  • We try to execute an operation (e.g. read, write, connect, ...) on nonblocking fds
  • If the operation would block (EAGAIN) we create an event that references the operation along with the current fiber
  • We eventually rely on the polling system (epoll or kqueue) to report when an fd is ready, which will dequeue a pending event and resume its associated fiber (one at a time).

Unlike libevent which adds and removes the fd to and from the polling system on every blocking operation, the lifetime event loop driver adds every fd once and only removes it when closing the fd.

The argument is that being notified for readiness (which happens only once thanks to edge-triggering) is less expensive than always modifying the polling system.

The mechanism only requires fine grained locks (read, write) which are usually uncontended, unless you share a fd for the same read or write operation in multiple fibers.

Note

On a purely IO-bound benchmark with long running sockets, we noticed up to 20% performance improvement. Real applications would see less improvements, though.

Multi-Threading

Caution

Moving file descriptors with pending operations between event loop instances – i.e. between threads with -Dpreview_mt – is an error. If you need this behaviour, it's best to stay with libevent via -Devloop=libevent.

It supports the multi-threading preview (-Dpreview_mt) with one event loop instance per thread (scheduler requirement).

Execution contexts (RFC #0002) will have one event loop instance per context.

With multiple event loop instances, it's necessary to define what happens when the same fd is used from multiple instances.

When an operation starts on a fd that is owned by a different event loop instance, we transfer it. The new event loop becomes the sole "owner" of the fd. This transfer happens implicitly.

This leads to a caveat: we can't have multiple fibers waiting for the same fd in different event loop instances (aka threads/contexts). Trying to transfer the fd raises if there is a waiting fiber in the old event loop. This is because an IO read/write can have a timeout which is registered in the current event loop timers, and timers aren't transferred. This also allows for future enhancements (e.g. enqueues are always local).

This can be an issue for preview_mt, for example when multiple fibers are waiting for connections on a server socket. This shall be mitigated with execution contexts where an event loop instance is shared per context — just don't share a fd between contexts.

Availability

The lifetime event loop driver is supported on:

  • Android
  • FreeBSD
  • Linux
  • macOS

On these operating systems, it's enabled automatically by default. Unsupported systems keep using libevent.

Compile time flags allow choosing a different event loop driver.

  • -Devloop=libevent: Use libevent
  • -Devloop=epoll: Use epoll (e.g. Solaris);
  • -Devloop=kqueue: Use kqueue (on *BSD);

The event loop drivers on Windows and WebAssembly (WASI) are not affected by this change. The event loop driver for Windows already integrates directly with the system selector (IOCP).

Terminology

  • Event loop: an abstraction to wait on specific events, such as timers (e.g. wait until a certain amount of time has passed) or IO (e.g. wait for an socket to be readable or writable).

  • Crystal event loop: Component of the Crystal runtime that subscribes to and dispatches events from the OS and other sources to enqueue a waiting fiber when a desired event (such as IO availability) has occured. It’s an interface between Crystal’s scheduler and the event loop driver.

  • Event loop driver: Connects the event loop backend (libevent, IOCP, epoll, kqueue, io_uring) with the Crystal runtime.

  • System selector: The system implementation of an event loop which the runtime component depends on to receive events from the operating system. Example implementations: epoll, kqueue, IOCP, io_uring. libevent is a cross-platform wrapper for system selectors.

  • Scheduler: manages fibers’ executions inside the program (controlled by Crystal), unlike threads that are scheduled by the operating system (outside of the accessibility of the program).

Reference-level explanation

The run loop first waits on epoll/kqueue, canceling IO timeouts as it resumes the ready fibers, then proceeds to process timers.

The epoll_wait and kevent syscalls can wait until a timeout is reached. The event loop implementations don't use the timeout and instead wait forever (blocking) or don't wait at all (nonblocking). There are a couple reasons:

  1. Epoll also has a minimum timeout precision of 1ms, but what if the next expiring timer is in 10us?

  2. In a single threaded context we could set the timeout to the next expiring timer; that would work for no MT as well as preview_mt, but if we share the event loop instance between threads —which will be the case with execution contexts— then we have a race condition: another thread can set a sooner expiring timer just before we'd wait, and a mutex wouldn't help to solve (can't keep it locked until the syscall returns).

Instead, we use an extra timerfd (Linux) and EVFILT_TIMER (macOS, *BSD) that have sub-millisecond precision (except for DragonFlyBSD), and that we can update in parallel when the next expiring timer has changed, using a mutex to prevent parallel races.

Components

Each syscall is abstracted in its own little struct: Crystal::System::Epoll, Crystal::System::TimerFD, etc.

  • Crystal::Evented namespace (src/crystal/system/unix/evented) contains the base implementation that the system specific drivers Crystal::Epoll::EventLoop (src/crystal/system/unix/epoll) and Crystal::Kqueue::EventLoop (src/crystal/system/unix/kqueue) are built on.
  • Crystal::Evented::Timers is a basic data structure to keep a list of timers (one instance per event loop).
  • Crystal::Evented::Event holds the event, be it IO or sleep or select timeout or IO with timeout, while FiberEvent wraps an Event for sleeps and select timeouts.
  • Crystal::Evented::PollDescriptor is allocated in a Generational Arena and keeps the list of readers and writers (events/fibers waiting on IO). It takes advantage that the OS kernel guarantees a unique fd number and always reuses numbers of closed fd, only adding more numbers when needed.

Poll Descriptors

To avoid keeping pointers to the IO object that could prevent the GC from collecting lost IO objects, this proposal introduces Poll Descriptor objects (the name comes from Go's netpoll) that keep the list of readers and writers and don't point back to the IO object. The GC collecting an IO object is fine: the finalizer will close the fd and tell the event loop to cleanup the associated Poll Descriptor (so we can safely reuse the fd).

To avoid pushing raw pointers into the kernel data structures, and to quickly retrieve the Poll Descriptor from a mere fd, but also to avoid programming errors that would segfault the program, this propsal introduces a Generational Arena to store the Poll Descriptors (the name is inherited from Go's netpoll) so we only store an index into the polling system. Another benefit is that we can reuse the existing allocation when a fd is reused. If we try to retrieve an outdated index (the allocation was freed or reallocated) the arena will raise an explicit exception.

Note

The goals of the arena are:

  • avoid repeated allocations;
  • avoid polluting the IO object with the PollDescriptor (doesn't exist in other event loops);
  • avoid saving raw pointers into kernel data structures;
  • safely detect allocation issues instead of segfaults because of raw pointers.

The Poll Descriptors associate a fd to an event loop instance, so we can still have multiple event loops per processes, yet make sure that an fd is only ever in one event loop. When a fd will block on another event loop instance, the fd will be transferred automatically (i.e. removed from the old one & added to the new one). The benefits are numerous: this avoids having multiple event loops being notified at the same time; this avoids having to close/remove the fd from each event loop instances; this avoids cross event loop enqueues that are much slower than local enqueues in execution contexts.

A limitation is that trying to move a fd from one event loop to another while there are pending waiters will raise an exception. We could move timeout events along with the fd from one event loop instance to another one, but that would also break the "always local enqueues" benefit.

Most application shouldn't notice any impact because of this design choice, since a fd is usually not shared across fibers for concurrency issues. An exception may be a server socket with multiple accepting fibers. In that case you'll need to make sure the fibers are on the same thread (preview_mt) or execution context.

If this ever proves to be an issue, we can think of solutions. For example migrating timeouts along, triaging external fibers from local ones, and more, of course at the expense of some performance.

Drawbacks

  • There's more code to maintain: Instead of only the libevent driver we now have additional ones to care about.

Rationale and alternatives

Prior art

Golang's netpoll uses a similar design, with the difference that Go only has a single event loop instance per process.

Unresolved questions

Timers and timeouts

We're missing a proper data structure to store timers and timeouts. It must be thread safe and efficient. Ideas are a minheap (4-heap) or a skiplist. Timers and timeouts may need to be handled separately.

This issue is blocking. An inefficient data structure wrecks performance for timers and timeouts.

Performance issues on BSDs

The kqueue driver is disabled on DragonFly BSD, OpenBSD and NetBSD due to performance regressions.

  • DragonFly BSD: Running std_spec is eons slower than libevent. It regularly hangs on event_loop.run until the stack pool collector timeout kicks in (i.e. loops on 5s pauses).

  • OpenBSD: Running std_spec is noticeably slower (4:15 minutes) compared to libevent (1:16 minutes). It appears that the main fiber keeps re-enqueueing itself from the event loop run (10us on each run).

  • NetBSD: The event loop doesn't work with kevent returning ENOENT for the signal loop fd and EINVAL when trying to set an EVFILT_TIMER.

The kqueue driver works fine on FreeBSD and Darwin.

These issues are non-blocking. We can keep using libevent on these operating systems until resolved.

Future possibilities

  • aio for async read/write over regular disk files with kqueue
  • Integrate more evented operations such as signalfd/EVFILT_SIGNAL and pidfd/EVFILT_PROC.