This document attempts to describe the structure of the code, still in part inherited from previous projects. It is surely incomplete, possibly not 100% accurate, and subject to change, hopefully to make the system better. It will from time to time lag behind the actual code, check this file's last modification date, and please report problems you may notice.
-
Application makes EGL/GLES2 API calls, through custom libEGL and libGLES2 libraries, implemented by the clientegl and clientgles modules
-
API calls are marshalled into messages, described in gls_command.h, sent using a user-selectable transport, and possibly wait for a reply before the API returns control to the application
-
The server side can work in two ways, depending on the selected transport:
- either we launch a one-shot server process, handling a single
connection before exiting (e.g.
stdio
transport, ortcp
in no-fork mode), - or a main server process just listens for connections, and forks
a dedicated "child" process to handle each incoming client
(e.g.
tcp
transport in default mode)
- either we launch a one-shot server process, handling a single
connection before exiting (e.g.
-
Server "child" receives packets with recvr in a dedicated thread, managed from glserver and puts them in a ring buffer
-
A server thread from glserver executes commands from the ring, with implementations from serveregl and servergles modules
-
Serverside implementations for calls with output parameters marshall a message like on client side and sends it back; client receives them from its own recvr thread, and received synchronously
Client-side API implementations are essentially forwarding the call parameters to the server (and receiving output data if any).
Every command message contains a command identifier, defined in gls_command.h, which in most cases uniquely map to an API call.
Small fixed-sized input parameters are filled into a per-API-call data
structure, located in out_buf
buffer provided to the client-side
implementations as temporary space for message composition before
sending.
Larger input data is sent with a SEND_DATA
message pior to the
command message itself.
When a client starts, it sends a special HANDSHAKE
message to check
the server's protocol version, and get the size of the server-created
window.
To avoid dynamic allocation, a fixed-size ring of fixed-size packet
buffers is pre-allocated. Data received from the network lands here
as long as there is space in the ring. It may be necessary to adjust
the compile-time RING_SIZE_ORDER
if "ring full" is reported (causing
throttling of network reception).
Incoming SEND_DATA
messages too large for a ring packet get a
newly-allocated buffer just for them. They still use a ring packet
preserving the order of arrival, and it contains the
gls_cmd_send_data_t
header, including a pointer to the allocated
data (a temporary hack which brings a nasty zero
field to the
protocol just to reserve space for that pointer).
When dequeuing an API command message from the ring, it gets passed by pointer to the API implementation.
SEND_DATA
message are handled according to where they were stored:
-
those small enough to fit in a ring slot (
dataptr == NULL
) are copied intmp_buf
to free their slot -
the larger ones get their
dataptr
stored inpool.mallocated
alongside the (unused)tmp_buf
; that points to theSEND_DATA
header, and we need it forfree()
later, and the data encapsulated within it is stored a ref inpool.data_payload
API commands expecting input from a SEND_DATA
message just use it
from there, according to the whether pool.mallocated
is NULL
or
holds a pointer (in which case it is finally freed after use).
A message with output parameters uses has its client-side
implementation use wait_for_data()
, which expects to receive a
SEND_DATA
message in client ring, and subsequently reads those
results from tmp_buf
just like the server does with large inputs.
Upstream used a batching mechanism to send several API calls in a single UDP packet. This has been culled from the current codebase as premature optimisation and because of inherent limitations (see older INTERNALS).
On client side, we try to keep the code minimal, just piping data to the server. However, not all API calls are simple enough for this to be possible. Some exceptions are:
The strings a client app can retrieve with those calls are specified to be static strings. Forwarding each call to the server would not only be inefficient (which is not a problem in itself, as this ought not to be a frequently-repeated call), but also raises memory management issues.
For eglQueryString
we have to wait until a valid EGLDisplay
is
passed to fetch any string other than client-extension. However since
the returned strings have to be static, we cannot even afford a
realloc()
call after a single API call. We thus have to store
client-extension in a separate storage buffer, and it will be queried
and cached on first call. All others, display-dependant, will be all
queried and stored the first time any of them is requested. All
further requests will hit the cache.
The glGetString
case is simpler but has the same "static string"
requirement; we can just fetch all valid strings to populate the cache
on first call.
Sometimes we can report at GLS client level an error before even
sending a command to the server. This is currently implemented for
EGL through the client_egl_error
variable. It can be set by any
egl*
function, and gets reset by send_packet
to avoid hiding later
errors.
For lack of a generic error code, we often use EGL_BAD_ACCESS.
What a client app wants is to get a callable function pointer, so we
could just check the availability of an implementation in GLS using
dlsym
. However there are reasons to make the call on the server
first:
- we don't want to return a function pointer if the server does not implement it
- when the client will make that call, we want to have the function pointer ready, so the server will use also cache the returned value
GLS cannot know which window is going to be used for rendering, before
the client app creates a Window Surface for it. At this point we
inspect the window, and send a CREATE_WINDOW
message so the server
can emulate what was already done on client side.
glVertexAttribPointer
interprets differently its pointer
parameter,
depending whether a VBO is active (in which case it is an offset into
the VBO) or not (in which case it is a pointer to client array).
The problem is that the client-array call does not specify the array
size, so the client cannot easily transfer the data at this point.
The vertex attributes set this way are used later, eg. by
glDrawArrays
, and those calls are the ones getting the number of
vertices for which attributes are given... so the data can only be
transferred to the server at that time. Then we cannot know either,
if the data will be reused and should be kept in server memory for a
later call, so to avoid leaking the server's memory we basically have
to retransfer the data every time it is used (we could also come with
other solutions, like implementing a fixed-sized server cache for
those arrays; it seems unreasonably difficult to find out by ourselves
and give hints to the server, without app's cooperation).
Thus this implementation defers the glVertexAttribPointer
calls
pointing to client-data, and additionally creates a VBO to hold the
attribute data (this is what upstream calls GLS_EMULATE_VBO
, has
been cleaned up and is unconditionally activated for lack of an
alternative, though we could implement a no-VBO version).
Some optimisations could be done, eg. send data block only once in the case of interleaved attributes in the same client array. Upstream code did this to some extent (only if attribute 0 was active and with dubious array-bounds handling), but this has been discarded as premature optimisation.
Note: the glBindBuffer
doc says the buffer "name" is "local to the
shared object space of the current GL rendering context". Since we're
not tracking any context we can assume this will break when more than
one shared object space gets used.
Currently we use a single (fixed-size) window on server side, created
when launching the server. This usually does not match the size of
the window created by client apps. When an app queries the width or
height of the window surface to fit rendering to this size, querying
the server for the real EGLSurface size gives unintended results (and
in the case where a demo app like es2tri
checks the size matches its
expectations, can result in early abort).
This gets a client-side Pixmap and, which needs to be transfered ot the server first. Needs a custom data struct and a WARN_INEFFICIENT, or shared memory.
(some items here have been improved on, would need a reality check)
Large data buffers cause problems to the fixed-size tmp_buf. Currently the code drops data that does not fit in the buffer, and the commands suffering this loss will not be executed as they will miss input.
Also, usage of oversized arrays in non-SEND_DATA
"just to be sure we
don't miss something" is still too common today.
As well, a number of calls don't properly check the size of data blocks will fit in the space reserved for them.
Some calls are complicated by the server lacking direct memory access
to the client's. Examples include glVertexAttribPointer
as
mentionned above, or eglCreatePixmapSurface
, which can be emulated
at a cost. Explicit memory mapping like the GL_OES_mapbuffer
extension bring this to another level, which can probably not be
efficiently implemented by a network transport.
However, in the case of a virtual machine, we can study the implications of establishing real memory sharing.
The window creation by client app is done out of the EGL scope, and is not intercepted. It is still drawn, and gets shown with a background that will stay black.
On server side, the rendering window is created when we know the size
we want it to get, that is when the client app make a call such as
eglCreateWindowSurface
.
There are several options for improving window/input operation, including:
- intercept the creation of local window and of further calls
targetting it, and forward events back: this is the best-performance
option on GPU side, but requires one of:
- X11-level proxying
- GLS support feature allowing GLS-awareness in window-managing layers (GLFW, SDL2, ...)
- change render to happen in an offscreen GPU buffer and stream it back to local window: definitely less complex than window interception, and we can think of shared-memory optimization for the local case, but will require to make use of hardware video (de)compression to be of use through the network
- arrange for the GPU to display the buffer over the window, keeping both low GPU impact and avoidance of window interception: naturally not possible for remote usage, likely not transparent eg. when moving windows
- in QubesOS case with sys-gui-gpu the wm already runs on server side, so it should be possible to integrate at this level with minimal overhead