README.esync

This is eventfd-based synchronization, or 'esync' for short. Turn it on with
WINEESYNC=1; debug it with +esync.

== BUGS AND LIMITATIONS ==

Please let me know if you find any bugs. If you can, also attach a log with
+seh,+pid,+esync,+server,+timestamp.

If you get something like "eventfd: Too many open files" and then things start
crashing, you've probably run out of file descriptors. esync creates one
eventfd descriptor for each synchronization object, and some games may use a
large number of these.  Linux by default limits a process to 4096 file
descriptors, which probably was reasonable back in the nineties but isn't
really anymore. (Fortunately Debian and derivatives [Ubuntu, Mint] already
have a reasonable limit.) To raise the limit you'll want to edit
/etc/security/limits.conf and add a line like

* hard nofile 1048576

then restart your session.

On distributions using systemd, the settings in `/etc/security/limits.conf`
will be overridden by systemd's own settings. If you run `ulimit -Hn` and it
returns a lower number than the one you've previously set, then you can set

DefaultLimitNOFILE=1048576

in both `/etc/systemd/system.conf` and `/etc/systemd/user.conf`. You can then
execute `sudo systemctl daemon-reexec` and restart your session. Check again
with `ulimit -Hn` that the limit is correct.

Also note that if the wineserver has esync active, all clients also must, and
vice versa. Otherwise things will probably crash quite badly.

== EXPLANATION ==

The aim is to execute all synchronization operations in "user-space", that is,
without going through wineserver. We do this using Linux's eventfd
facility. The main impetus to using eventfd is so that we can poll multiple
objects at once; in particular we can't do this with futexes, or pthread
semaphores, or the like. The only way I know of to wait on any of multiple
objects is to use select/poll/epoll to wait on multiple fds, and eventfd gives
us those fds in a quite usable way.

Whenever a semaphore, event, or mutex is created, we have the server, instead
of creating a traditional server-side event/semaphore/mutex, instead create an
'esync' primitive. These live in esync.c and are very slim objects; in fact,
they don't even know what type of primitive they are. The server is involved
at all because we still need a way of creating named objects, passing handles
to another process, etc.

The server creates an eventfd file descriptor with the requested parameters
and passes it back to ntdll. ntdll creates an object of the appropriate type,
then caches it in a table. This table is copied almost wholesale from the fd
cache code in server.c.

Specific operations follow quite straightforwardly from eventfd:

* To release an object, or set an event, we simply write() to it.
* An object is signalled if read() succeeds on it. Notably, we create all
  eventfd descriptors with O_NONBLOCK, so that we can atomically check if an
  object is signalled and grab it if it is. This also lets us reset events.
* For objects whose state should not be reset upon waiting—e.g. manual-reset
  events—we simply check for the POLLIN flag instead of reading.
* Semaphores are handled by the EFD_SEMAPHORE flag. This matches up quite well
  (although with some difficulties; see below).
* Mutexes store their owner thread locally. This isn't reliable information if
  a different process's thread owns the mutex, but this doesn't matter—a
  thread should only care whether it owns the mutex, so it knows whether to
  try waiting on it or simply to increase the recursion count.

The interesting part about esync is that (almost) all waits happen in ntdll,
including those on server-bound objects. The idea here is that on the server
side, for any waitable object, we create an eventfd file descriptor (not an
esync primitive), and then pass it to ntdll if the program tries to wait on
it. These are cached too, so only the first wait will require a round trip to
the server. Then the server signals the file descriptor as appropriate, and
thereby wakes up the client. So far this is implemented for processes,
threads, message queues (difficult; see below), and device managers (necessary
for drivers to work). All of these are necessarily server-bound, so we
wouldn't really gain anything by signalling on the client side instead. Of
course, except possibly for message queues, it's not likely that any program
(cutting-edge D3D game or not) is going to be causing a great wineserver load
by waiting on any of these objects; the motivation was rather to provide a way
to wait on ntdll-bound and server-bound objects at the same time.

Some cases are still passed to the server, and there's probably no reason not
to keep them that way. Those that I noticed while testing include: async
objects, which are internal to the file APIs and never exposed to userspace,
startup_info objects, which are internal to the loader and signalled when a
process starts, and keyed events, which are exposed through an ntdll API
(although not through kernel32) but can't be mixed with other objects (you
have to use NtWaitForKeyedEvent()). Other cases include: named pipes, debug
events, sockets, and timers. It's unlikely we'll want to optimize debug events
or sockets (or any of the other, rather rare, objects), but it is possible
we'll want to optimize named pipes or timers.

There were two sort of complications when working out the above. The first one
was events. The trouble is that (1) the server actually creates some events by
itself and (2) the server sometimes manipulates events passed by the
client. Resolving the first case was easy enough, and merely entailed creating
eventfd descriptors for the events the same way as for processes and threads
(note that we don't really lose anything this way; the events include
"LowMemoryCondition" and the event that signals system processes to shut
down). For the second case I basically had to hook the server-side event
functions to redirect to esync versions if the event was actually an esync
primitive.

The second complication was message queues. The difficulty here is that X11
signals events by writing into a pipe (at least I think it's a pipe?), and so
as a result wineserver has to poll on that descriptor. In theory we could just
let wineserver do so and then signal us as appropriate, except that wineserver
only polls on the pipe when the thread is waiting for events (otherwise we'd
get e.g. keyboard input while the thread is doing something else, and spin
forever trying to wake up a thread that doesn't care). The obvious solution is
just to poll on that fd ourselves, and that's what I did—it's just that
getting the fd from wineserver was kind of ugly, and the code for waiting was
also kind of ugly basically because we have to wait on both X11's fd and the
"normal" process/thread-style wineserver fd that we use to signal sent
messages. The upshot about the whole thing was that races are basically
impossible, since a thread can only wait on its own queue.

System APCs already work, since the server will forcibly suspend a thread if
it's not already waiting, and so we just need to check for EINTR from
poll(). User APCs and alertable waits are implemented in a similar style to
message queues (well, sort of): whenever someone executes an alertable wait,
we add an additional eventfd to the list, which the server signals when an APC
arrives. If that eventfd gets signaled, we hand it off to the server to take
care of, and return STATUS_USER_APC.

Originally I kept the volatile state of semaphores and mutexes inside a
variable local to the handle, with the knowledge that this would break if
someone tried to open the handle elsewhere or duplicate it. It did, and so now
this state is stored inside shared memory. This is of the POSIX variety, is
allocated by the server (but never mapped there) and lives under the path
"/wine-esync".

There are a couple things that this infrastructure can't handle, although
surprisingly there aren't that many. In particular:
* Implementing wait-all, i.e. WaitForMultipleObjects(..., TRUE, ...), is not
  exactly possible the way we'd like it to be possible. In theory that
  function should wait until it knows all objects are available, then grab
  them all at once atomically. The server (like the kernel) can do this
  because the server is single-threaded and can't race with itself. We can't
  do this in ntdll, though. The approach I've taken I've laid out in great
  detail in the relevant patch, but for a quick summary we poll on each object
  until it's signaled (but don't grab it), check them all again, and if
  they're all signaled we try to grab them all at once in a tight loop, and if
  we fail on any of them we reset the count on whatever we shouldn't have
  consumed. Such a blip would necessarily be very quick.
* The whole patchset only works on Linux, where eventfd is available. However,
  it should be possible to make it work on a Mac, since eventfd is just a
  quicker, easier way to use pipes (i.e. instead of writing 1 to the fd you'd
  write 1 byte; instead of reading a 64-bit value from the fd you'd read as
  many bytes as you can carry, which is admittedly less than 2**64 but
  can probably be something reasonable.) It's also possible, although I
  haven't yet looked, to use some different kind of synchronization
  primitives, but pipes would be easiest to tack onto this framework.
* PulseEvent() can't work the way it's supposed to work. Fortunately it's rare
  and deprecated. It's also explicitly mentioned on MSDN that a thread can
  miss the notification for a kernel APC, so in a sense we're not necessarily
  doing anything wrong.

There are some things that are perfectly implementable but that I just haven't
done yet:
* Other synchronizable server primitives. It's unlikely we'll need any of
  these, except perhaps named pipes (which would honestly be rather difficult)
  and (maybe) timers.
* Access masks. We'd need to store these inside ntdll, and validate them when
  someone tries to execute esync operations.

This patchset was inspired by Daniel Santos' "hybrid synchronization"
patchset. My idea was to create a framework whereby even contended waits could
be executed in userspace, eliminating a lot of the complexity that his
synchronization primitives used. I do however owe some significant gratitude
toward him for setting me on the right path.

I've tried to maximize code separation, both to make any potential rebases
easier and to ensure that esync is only active when configured. All code in
existing source files is guarded with "if (do_esync())", and generally that
condition is followed by "return esync_version_of_this_method(...);", where
the latter lives in esync.c and is declared in esync.h. I've also tried to
make the patchset very clear and readable—to write it as if I were going to
submit it upstream. (Some intermediate patches do break things, which Wine is
generally against, but I think it's for the better in this case.) I have cut
some corners, though; there is some error checking missing, or implicit
assumptions that the program is behaving correctly.

I've tried to be careful about races. There are a lot of comments whose
purpose are basically to assure me that races are impossible. In most cases we
don't have to worry about races since all of the low-level synchronization is
done by the kernel.

Anyway, yeah, this is esync. Use it if you like.

--Zebediah Figura