- Feature Name:
lifetime-event_loop
- Start Date: 2024-11-11
- RFC PR: crystal-lang/rfcs#0009
- Issue: crystal-lang/crystal#14996
Integrate the Crystal event loop directly with system selectors,
epoll
(Linux, Android) and
kqueue
(BSDs, macOS) instead of
going through libevent
.
Direct integration with the system selectors enables a more performant
implementation thanks to a design change. Going from ad-hoc event subscriptions
to lifetime subscriptions per file descriptor (fd
) reduces overhead per
operation.
Dropping libevent
also removes an external runtime dependency from Crystal
programs.
This also prepares the event loop foundation for implementing execution contexts from RFC #0002.
This new event loop driver builds on top of the event loop refactor from RFC
#0007. It plugs right into the runtime and does
not require any changes in user code, even for direct consumers of the
Crystal::EventLoop
API.
The new event loop driver is enabled automatically on supported targets (see
Availability). The libevent
driver serves as a fallback for
unsupported targets and can be opted-in with the compile-time flag
-Deventloop=libevent
.
The logic of the event loop doesn't change much from the one based on
libevent
:
- We try to execute an operation (e.g.
read
,write
,connect
, ...) on nonblockingfd
s - If the operation would block (
EAGAIN
) we create an event that references the operation along with the current fiber - We eventually rely on the polling system (
epoll
orkqueue
) to report when anfd
is ready, which will dequeue a pending event and resume its associated fiber (one at a time).
Unlike libevent
which adds and removes the fd
to and from the polling system
on every blocking operation, the lifetime event loop driver adds every fd
once and only removes it when closing the fd
.
The argument is that being notified for readiness (which happens only once thanks to edge-triggering) is less expensive than always modifying the polling system.
The mechanism only requires fine grained locks (read, write) which are usually
uncontended, unless you share a fd
for the same read or write operation in
multiple fibers.
Note
On a purely IO-bound benchmark with long running sockets, we noticed up to 20% performance improvement. Real applications would see less improvements, though.
Caution
Moving file descriptors with pending operations between event loop instances –
i.e. between threads with -Dpreview_mt
– is an error. If you need this
behaviour, it's best to stay with libevent
via -Devloop=libevent
.
It supports the multi-threading preview (-Dpreview_mt
) with one event loop
instance per thread (scheduler requirement).
Execution contexts (RFC #0002) will have one event loop instance per context.
With multiple event loop instances, it's necessary to define what happens when
the same fd
is used from multiple instances.
When an operation starts on a fd
that is owned by a different event loop
instance, we transfer it. The new event loop becomes the sole "owner" of the
fd
. This transfer happens implicitly.
This leads to a caveat: we can't have multiple fibers waiting for the same fd
in different event loop instances (aka threads/contexts). Trying to transfer the
fd
raises if there is a waiting fiber in the old event loop. This is because
an IO read/write can have a timeout which is registered in the current event
loop timers, and timers aren't transferred. This also allows for future
enhancements (e.g. enqueues are always local).
This can be an issue for preview_mt
, for example when multiple fibers are
waiting for connections on a server socket. This shall be mitigated with
execution contexts where an event loop instance is shared per context — just
don't share a fd
between contexts.
The lifetime event loop driver is supported on:
- Android
- FreeBSD
- Linux
- macOS
On these operating systems, it's enabled automatically by default. Unsupported
systems keep using libevent
.
Compile time flags allow choosing a different event loop driver.
-Devloop=libevent
: Uselibevent
-Devloop=epoll
: Useepoll
(e.g. Solaris);-Devloop=kqueue
: Usekqueue
(on *BSD);
The event loop drivers on Windows and WebAssembly (WASI) are not affected by
this change. The event loop driver for Windows already integrates directly with
the system selector (IOCP
).
-
Event loop: an abstraction to wait on specific events, such as timers (e.g. wait until a certain amount of time has passed) or IO (e.g. wait for an socket to be readable or writable).
-
Crystal event loop: Component of the Crystal runtime that subscribes to and dispatches events from the OS and other sources to enqueue a waiting fiber when a desired event (such as IO availability) has occured. It’s an interface between Crystal’s scheduler and the event loop driver.
-
Event loop driver: Connects the event loop backend (
libevent
,IOCP
,epoll
,kqueue
,io_uring
) with the Crystal runtime. -
System selector: The system implementation of an event loop which the runtime component depends on to receive events from the operating system. Example implementations:
epoll
,kqueue
,IOCP
,io_uring
.libevent
is a cross-platform wrapper for system selectors. -
Scheduler: manages fibers’ executions inside the program (controlled by Crystal), unlike threads that are scheduled by the operating system (outside of the accessibility of the program).
The run loop first waits on epoll
/kqueue
, canceling IO timeouts as it
resumes the ready fibers, then proceeds to process timers.
The epoll_wait
and kevent
syscalls can wait until a timeout is reached.
The event loop implementations don't use the timeout and instead wait forever
(blocking) or don't wait at all (nonblocking). There are a couple reasons:
-
Epoll also has a minimum timeout precision of 1ms, but what if the next expiring timer is in 10us?
-
In a single threaded context we could set the timeout to the next expiring timer; that would work for no MT as well as
preview_mt
, but if we share the event loop instance between threads —which will be the case with execution contexts— then we have a race condition: another thread can set a sooner expiring timer just before we'd wait, and a mutex wouldn't help to solve (can't keep it locked until the syscall returns).
Instead, we use an extra timerfd
(Linux) and EVFILT_TIMER
(macOS, *BSD)
that have sub-millisecond precision (except for DragonFlyBSD), and that we can
update in parallel when the next expiring timer has changed, using a mutex to
prevent parallel races.
Each syscall is abstracted in its own little struct: Crystal::System::Epoll
,
Crystal::System::TimerFD
, etc.
Crystal::Evented
namespace (src/crystal/system/unix/evented
) contains the base implementation that the system specific driversCrystal::Epoll::EventLoop
(src/crystal/system/unix/epoll
) andCrystal::Kqueue::EventLoop
(src/crystal/system/unix/kqueue
) are built on.Crystal::Evented::Timers
is a basic data structure to keep a list of timers (one instance per event loop).Crystal::Evented::Event
holds the event, be it IO or sleep or select timeout or IO with timeout, whileFiberEvent
wraps anEvent
for sleeps and select timeouts.Crystal::Evented::PollDescriptor
is allocated in a Generational Arena and keeps the list of readers and writers (events/fibers waiting on IO). It takes advantage that the OS kernel guarantees a uniquefd
number and always reuses numbers of closedfd
, only adding more numbers when needed.
To avoid keeping pointers to the IO object that could prevent the GC from
collecting lost IO objects, this proposal introduces Poll Descriptor objects
(the name comes from Go's netpoll) that keep the list of readers and writers and
don't point back to the IO object. The GC collecting an IO object is fine: the
finalizer will close the fd
and tell the event loop to cleanup the associated Poll Descriptor (so we can safely reuse the fd
).
To avoid pushing raw pointers into the kernel data structures, and to quickly
retrieve the Poll Descriptor from a mere fd
, but also to avoid programming
errors that would segfault the program, this propsal introduces a Generational
Arena to store the Poll Descriptors (the name is inherited from Go's netpoll)
so we only store an index into the polling system. Another benefit is that we
can reuse the existing allocation when a fd
is reused. If we try to retrieve
an outdated index (the allocation was freed or reallocated) the arena will raise
an explicit exception.
Note
The goals of the arena are:
- avoid repeated allocations;
- avoid polluting the IO object with the PollDescriptor (doesn't exist in other event loops);
- avoid saving raw pointers into kernel data structures;
- safely detect allocation issues instead of segfaults because of raw pointers.
The Poll Descriptors associate a fd
to an event loop instance, so we can
still have multiple event loops per processes, yet make sure that an fd
is
only ever in one event loop. When a fd
will block on another event loop
instance, the fd
will be transferred automatically (i.e. removed from the old
one & added to the new one). The benefits are numerous: this avoids having
multiple event loops being notified at the same time; this avoids having to
close/remove the fd
from each event loop instances; this avoids cross event
loop enqueues that are much slower than local enqueues in execution contexts.
A limitation is that trying to move a fd
from one event loop to another while
there are pending waiters will raise an exception. We could move timeout events
along with the fd
from one event loop instance to another one, but that would
also break the "always local enqueues" benefit.
Most application shouldn't notice any impact because of this design choice,
since a fd
is usually not shared across fibers for concurrency issues. An
exception may be a server socket with multiple accepting fibers. In that case
you'll need to make sure the fibers are on the same thread (preview_mt
) or
execution context.
If this ever proves to be an issue, we can think of solutions. For example migrating timeouts along, triaging external fibers from local ones, and more, of course at the expense of some performance.
- There's more code to maintain: Instead of only the
libevent
driver we now have additional ones to care about.
Golang's netpoll uses a similar design, with the difference that Go only has a single event loop instance per process.
We're missing a proper data structure to store timers and timeouts. It must be thread safe and efficient. Ideas are a minheap (4-heap) or a skiplist. Timers and timeouts may need to be handled separately.
This issue is blocking. An inefficient data structure wrecks performance for timers and timeouts.
The kqueue
driver is disabled on DragonFly BSD, OpenBSD and NetBSD due to
performance regressions.
-
DragonFly BSD: Running
std_spec
is eons slower than libevent. It regularly hangs onevent_loop.run
until the stack pool collector timeout kicks in (i.e. loops on 5s pauses). -
OpenBSD: Running
std_spec
is noticeably slower (4:15 minutes) compared to libevent (1:16 minutes). It appears that the main fiber keeps re-enqueueing itself from the event loop run (10us on each run). -
NetBSD: The event loop doesn't work with
kevent
returningENOENT
for the signal loopfd
andEINVAL
when trying to set anEVFILT_TIMER
.
The kqueue
driver works fine on FreeBSD and Darwin.
These issues are non-blocking. We can keep using libevent
on these operating
systems until resolved.
aio
for async read/write over regular disk files withkqueue
- Integrate more evented operations such as
signalfd
/EVFILT_SIGNAL
andpidfd
/EVFILT_PROC
.