Skip to content
WALDEMAR KOZACZUK edited this page Apr 12, 2023 · 14 revisions

Introduction

By design, most applications running on OSv do not execute system calls when calling the libc functions. For example, an invocation of a mmap() is a direct local function call resolved by OSv dynamic linker that involves very few instructions and is therefore very fast. On Linux, the same call is way more expensive as it goes through a wrapper function in glibc which then invokes the system call SYS_mmap that involves a CPU ring and virtual address space switch among other things. This OSv optimization may not be as relevant as one would hope, especially when applications make few mmap() calls as is often the case, but this is a topic for another story.

Some applications like Golang or statically linked applications (see this for more details) bypass the libc layer and invoke systems calls directly using the SYSCALL (x86_64) or SVC (aarch64) instructions. To support those, OSv implemented the system call handler machinery in assembly for both x86_64 and aarch64.

Unlike Linux, where libc functions like mmap() delegate to the corresponding system calls (SYS_mmap in the example above), in OSv the opposite happens. Just like in Linux, OSv implements the SYSCALL and SVC instructions for x86_64 and aarch64 respectively (see syscall_entry in arch/x64/entry.S and handle_system_call in arch/aarch64/entry.S). This tricky low-level assembly code (~50 instructions) switches to a dedicated system call stack, saves all necessary registers, and delegates to syscall_wrapper() and eventually syscall() functions implemented in C++ in linux.cc. Finally, the syscall() function has a case statement that calls the relevant libc function implemented by OSv.

Sometimes there is no libc function that syscall() can directly invoke by a simple preprocessor statement like below:

SYSCALL2(listen, int, int);

In those cases like futex() or getdents64() we implement relevant functions which we delegate to like so:

int futex(int *uaddr, int op, int val, const struct timespec *timeout,
         int *uaddr2, uint32_t val3)
{
     switch (op & FUTEX_CMD_MASK) {
...
}

SYSCALL6(futex, int *, int, int, const struct timespec *, int *, uint32_t);

As one can tell, the system call invocations in OSv are slower and more expensive than the regular direct local function calls but should still be faster than in Linux because OSv does not have to switch the virtual memory mapping nor CPU ring.

Implemented

Please note that even if OSv does not implement (aka expose) a specific system call by delegating to the relevant libc function, the syscall() function would log such a fact and return ENOSYS instead of crashing:

syscall(): unimplemented system call <nnn>

Here is a list of 73 system calls OSv exposes as of April 2023:

  • accept4
  • bind
  • clock_getres
  • clock_gettime
  • close
  • connect
  • dup3
  • epoll_create1
  • epoll_ctl
  • epoll_pwait
  • epoll_wait
  • eventfd2
  • exit
  • exit_group
  • fcntl
  • fdatasync
  • flock
  • fstat
  • fstatat
  • fsync
  • ftruncate
  • futex
  • getcwd
  • getdents64
  • getgid
  • get_mempolicy
  • getpeername
  • getpid
  • getrandom
  • getsockname
  • getsockopt
  • gettid
  • getuid
  • ioctl
  • listen
  • lseek
  • madvise
  • mincore
  • mkdir
  • mkdirat
  • mmap
  • munmap
  • nanosleep
  • open
  • openat
  • pipe2
  • pread64
  • pselect6
  • pwrite64
  • read
  • readlinkat
  • recvfrom
  • recvmsg
  • renameat
  • rt_sigaction
  • rt_sigprocmask
  • sched_getaffinity
  • sched_setaffinity
  • sched_yield
  • select
  • sendmsg
  • sendto
  • set_mempolicy
  • setsockopt
  • sigaltstack
  • socket
  • stat
  • statfs
  • symlinkat
  • tgkill
  • uname
  • unlinkat
  • write

Trivial to Implement

Most of the time implementing (or exposing) new system calls is very trivial as long as there is a corresponding libc function in OSv (see the introduction section). Here is a list of 60 systems calls that should be trivial to expose (minus those that do not exactly map one-to-one to the libc functions):

  • accept
  • access
  • alarm
  • chdir
  • creat
  • dup
  • dup2
  • epoll_create
  • eventfd
  • fallocate
  • faccessat
  • fchdir
  • fstatfs
  • futimesat
  • getitimer
  • getpriority
  • getrlimit
  • getrusage
  • gettimeofday
  • kill
  • lstat
  • mprotect
  • msync
  • pause
  • pipe
  • poll
  • ppoll
  • prctl
  • readlink
  • readv
  • rename
  • rmdir
  • sched_get_priority_max
  • sched_get_priority_min
  • sendfile
  • sethostname
  • setitimer
  • setpriority
  • setrlimit
  • shmget
  • shmat
  • shmctl
  • shmdt
  • shutdown
  • socketpair
  • symlink
  • sync
  • sysinfo
  • time
  • timerfd_create
  • timerfd_gettime
  • timerfd_settime
  • times
  • truncate
  • umask
  • unlink
  • utime
  • utimensat
  • utimes
  • writev
Clone this wiki locally