Skip to content

Commit

Permalink
Preparation for 0.6.0 (#517)
Browse files Browse the repository at this point in the history
Co-authored-by: David Chisnall <[email protected]>
Co-authored-by: Robert Norton <[email protected]>
Co-authored-by: Nathaniel Wesley Filardo <[email protected]>
Co-authored-by: Istvan Haller <[email protected]>
  • Loading branch information
5 people authored May 9, 2022
1 parent 5906b14 commit d5c732f
Show file tree
Hide file tree
Showing 21 changed files with 3,062 additions and 46 deletions.
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,24 @@ scenarios that can be problematic for other allocators:
Both of these can cause massive reductions in performance of other allocators, but
do not for snmalloc.

Comprehensive details about snmalloc's design can be found in the
[accompanying paper](snmalloc.pdf), and differences between the paper and the
current implementation are [described here](difference.md).
Since writing the paper, the performance of snmalloc has improved considerably.
The implementation of snmalloc has evolved significantly since the [initial paper](snmalloc.pdf).
The mechanism for returning memory to remote threads has remained, but most of the meta-data layout has changed.
We recommend you read [docs/security](./docs/security/README.md) to find out about the current design, and
if you want to dive into the code (./docs/AddressSpace.md) provides a good overview of the allocation and deallocation paths.

[![snmalloc CI](https://github.com/microsoft/snmalloc/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/microsoft/snmalloc/actions/workflows/main.yml)

# Hardening

There is a hardened version of snmalloc, it contains

* Randomisation of the allocations' relative locations,
* Most meta-data is stored separately from allocations, and is protected with guard pages,
* All in-band meta-data is protected with a novel encoding that can detect corruption, and
* Provides a `memcpy` that automatically checks the bounds relative to the underlying malloc.

A more comprehensive write up is in [docs/security](./docs/security/README.md).

# Further documentation

- [Instructions for building snmalloc](docs/BUILDING.md)
Expand Down
42 changes: 0 additions & 42 deletions difference.md

This file was deleted.

130 changes: 130 additions & 0 deletions docs/security/FreelistProtection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Protecting meta-data

Corrupting an allocator's meta-data is a common pattern for increasing the power of a use-after-free or out-of-bounds write vulnerabilities.
If you can corrupt the allocator's meta-data, then you can take a control gadget in one part of a system, and use it to affect other parts of the system.
There are various approaches to protecting allocator meta-data, the most common are:

* make the allocator meta-data hard to find through randomisation
* use completely separate ranges of memory for meta-data and allocations
* surround meta-data with guard pages
* add some level of encryption/checksuming

With the refactoring of the page table ([described earlier](./VariableSizedChunks.md)), we can put all the slab meta-data in completely separate regions of memory to the allocations.
We maintain this separation over time, and never allow memory that has been used for allocations to become meta-data and vice versa.
Within the meta-data regions, we add randomisation to make the data hard to find, and add large guard regions around the meta-data.
By using completely separate regions of memory for allocations and meta-data we ensure that no dangling allocation can refer to current meta-data.
This is particularly important for CHERI as it means a UAF can be used to corrupt allocator meta-data.

But there is one super important bit that still remains: free lists.

## What are free lists?

Many allocators chain together unused allocations into a linked list.
This is remarkably space efficient, as it doesn't require meta-data proportional to the number of allocations on a slab.
The disused objects can be used in either a linked stack or queue.
However, the key problem is neither randomisation or guard pages can be used to protect this _in-band_ meta-data.

In snmalloc, we have introduced a novel technique for protecting this data.

## Protecting a free queue.

The idea is remarkably simple: a doubly linked list is far harder to corrupt than a single linked list, because you can check its invariant:
```
x.next.prev == x
```
In every kind of free list in snmalloc, we encode both the forward and backward pointers in our lists.
For the forward direction, we use an [involution](https://en.wikipedia.org/wiki/Involution_(mathematics)), `f`, such as XORing a randomly choosen value:
```
f(a) = a XOR k0
```
For the backward direction, we use a more complex, two-argument function
```
g(a, b) = (a XOR k1) * (b XOR k2)
```
where `k1` and `k2` are two randomly chosen 64 bit values.
The encoded back pointer of the node after `x` in the list is `g(x, f(x.next))`, which gives a value that is hard to forge and still encodes the back edge relationship.

As we build the list, we add this value to the disused object, and when we consume the free list later, we check the value is correct.
Importantly, the order of construction and consumption have to be the same, which means we can only use queues, and not stacks.

The checks give us a way to detect that the list has not been corrupted.
In particular, use-after-free or out-of-bounds writes to either the `next` or `prev` value are highly likely to be detected later.

## Double free protection

This encoding also provides a great double free protection.
If you free twice, it will corrupt the `prev` pointer, and thus when we come to reallocate that object later, we will detect the double free.
The following animation shows the effect of a double free:

![Double free protection example](./data/doublefreeprotection.gif)

This is a weak protection as it is lazy, in that only when the object is reused will snmalloc raise an error, so a `malloc` can fail due to double free, but we are only aiming to make exploits harder; this is not a bug finding tool.


## Where do we use this?

Everywhere we link disused objects, so (1) per-slab free queues and (2) per-allocator message queues for returning freed allocations to other threads.
Originally, snmalloc used queues for returning memory to other threads.
We had to refactor the per slab free lists to be queues rather than stacks, but that is fairly straightforward.
The code for the free lists can be found here:

[Code](https://github.com/microsoft/snmalloc/blob/main/src/snmalloc/mem/freelist.h)

The idea could easily be applied to other allocators, and we're happy to discuss this.

## Finished assembly

So let's look at what costs we incur from this.
There are bits that are added to both creating the queues, and taking elements from the queues.
Here we show the assembly for taking from a per-slab free list, which is integrated into the fast path of allocation:
```x86asm
<malloc(unsigned long)>:
lea rax,[rdi-0x1] # Check for small size class
cmp rax,0xdfff # | zero is considered a large size
ja SLOW_SIZE # | to remove from fast path.
shr rax,0x4 # Lookup size class in table
lea rcx,[size_table] # |
movzx edx,BYTE PTR [rax+rcx*1] # |
mov rdi,rdx #+Caclulate index into free lists
shl rdi,0x4 #+| (without checks this is a shift by
# | 0x3, and can be fused into an lea)
mov r8,QWORD PTR [rip+0xab9b] # Find thread local allocator state
mov rcx,QWORD PTR fs:0x0 # |
add rcx,r8 # |
add rcx,rdi # Load head of free list for size class
mov rax,QWORD PTR fs:[r8+rdi*1] # |
test rax,rax # Check if free list is empty
je SLOW_PATH_REFILL # |
mov rsi,QWORD PTR fs:0x0 # Calculate location of free list structure
add rsi,r8 # | rsi = fs:[r8]
mov rdx,QWORD PTR fs:[r8+0x2e8] #+Load next pointer key
xor rdx,QWORD PTR [rax] # Load next pointer
prefetcht0 BYTE PTR [rdx] # Prefetch next object
mov QWORD PTR [rcx],rdx # Update head of free list
mov rcx,QWORD PTR [rax+0x8] #+Check signed_prev value is correct
cmp rcx,QWORD PTR fs:[r8+rdi*1+0x8] #+|
jne CORRUPTION_ERROR #+|
lea rcx,[rdi+rsi*1] #+Calculate signed_prev location
add rcx,0x8 #+| rcx = fs:[r8+rdi*1+0x8]
mov rsi,QWORD PTR fs:[r8+0x2d8] #+Calculate next signed_prev value
add rsi,rax #+|
add rdx,QWORD PTR fs:[r8+0x2e0] #+|
imul rdx,rsi #+|
mov QWORD PTR [rcx],rdx #+Store signed_prev for next entry.
ret
```
The extra instructions specific to handling the checks are marked with `+`.
As you can see the fast path is about twice the length of the fast path without protection, but only adds a single branch to the fast path, one multiplication, five additional loads, and one store.
The loads only involve one additional cache line for key material.
Overall, the cost is surprisingly low.

Note: the free list header now contains the value that `prev` should contain, which leads to slightly worse x86 codegen.
For instance the checks introduce `shl rdi,0x4`, which was previously fused with an `lea` instruction without the checks.

## Conclusion

This approach provides a strong defense against corruption of the free lists used in snmalloc.
This means all inline meta-data has corruption detection.
The check is remarkably simple for building double free detection, and has far lower memory overhead compared to using an allocation bitmap.

[Next we show how to randomise the layout of memory in snmalloc, and thus make it harder to guess relative address of a pair of allocations.](./Randomisation.md)
151 changes: 151 additions & 0 deletions docs/security/GuardedMemcpy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Providing a guarded memcpy

Out of bounds errors are a serious problem for systems.
We did some analysis of the Microsoft Security Response Center data to look at the out-of-bounds heap corruption, and found a common culprit: `memcpy`.
Of the OOB writes that were categorised as leading to remote code execution (RCE), 1/3 of them had a block copy operation like memcpy as the initial source of corruption.
This makes any mitigation to `memcpy` extremely high-value.

Now, if a `memcpy` crosses a boundary of a `malloc` allocation, then we have a well-defined error in the semantics of the program.
No sensible program should do this.
So let's see how we detect this with snmalloc.


## What is `memcpy`?

So `memcpy(src, dst, len)` copies `len` bytes from `src` to `dst`.
For this to be valid, we can check:
```
if (src is managed by snmalloc)
check(remaining_bytes(src) >= len)
if (dst is managed by snmalloc)
check(remaining_bytes(dst) >= len)
```
Now, the first `if` is checking for reading beyond the end of the object, and the second is checking for writing beyond the end of the destination object.
By default, for release checks we only check the `dst` is big enough.


## How can we implement `remaining_bytes`?

In the previous [page](./VariableSizedChunks.md), we discussed how we enable variable sized slabs.
Let's consider how that representation enables us to quickly find the start/end of any object.

All slab sizes are powers of two, and a given slab's lowest address will be naturally aligned for the slab's size.
(For brevity, slabs are sometimes said to be "naturally aligned (at) powers of two".)
That is if `x` is the start of a slab of size `2^n`, then `x % (2^n) == 0`.
This means that a single mask can be used to find the offset into a slab.
As the objects are layed out continguously, we can also get the offset in the object with a modulus operations, that is, `remaining_bytes(p)` is effectively:
```
object_size - ((p % slab_size) % object_size)
```

Well, as anyone will tell you, division/modulus on a fast path is a non-starter.
The first modulus is easy to deal with, we can replace `% slab_size` with a bit-wise mask.
However, as `object_size` can be non-power-of-two values, we need to work a little harder.

## Reciprocal division to the rescue

When you have a finite domain, you can lower divisions into a multiply and shift.
By pre-calculating `c = (((2^n) - 1)/size) + 1`, the division `x / size` can instead be computed by
```
(x * c) >> n
```
The choice of `n` has to be done carefully for the possible values of `x`, but with a large enough `n` we can make this work for all slab offsets and sizes.

Now from division, we can calculate the modulus, by multiplying the result of the division
by the size, and then subtracting the result from the original value:
```
x - (((x * c) >> n) * size)
```
and thus `remaining_bytes(x)` is:
```
(((x * c) >> n) * size) + size - x
```

There is a great article that explains this in more detail by [Daniel Lemire](https://lemire.me/blog/2019/02/20/more-fun-with-fast-remainders-when-the-divisor-is-a-constant/).

Making sure you have everything correct is tricky, but thankfully computers are fast enough to check all possilities.
In snmalloc, we have a test program that verifies, for all possible slab offsets and all object sizes, that our optimised result is equivalent to the original modulus.

We build the set of constants per sizeclass using `constexpr`, which enables us to determine the end of an object in a handful of instructions.

## Non-snmalloc memory.

The `memcpy` function is not just called on memory that is received from `malloc`.
This means we need our lookup to work on all memory, and in the case where it is not managed by `snmalloc` to assume it is correct.
We ensure that the `0` value in the chunk map is interpreted as an object covering the whole of the address space.
This works for compatibility.

To achieve this nicely, we map 0 to a slab that covers the whole of address space, and consider there to be single object in this space.
This works by setting the reciprocal constant to 0, and then the division term is always zero.

There is a second complication: `memcpy` can be called before `snmalloc` has been initialised.
So we need a check for this case.

## Finished Assembly

The finished assembly for checking the destination length in `memcpy` is:

```x86asm
<memcpy_guarded>:
mov rax,QWORD PTR [rip+0xbfa] # Load Chunk map base
test rax,rax # Check if chunk map is initialised
je DONE # |
mov rcx,rdi # Get chunk map entry
shr rcx,0xa # |
and rcx,0xfffffffffffffff0 # |
mov rax,QWORD PTR [rax+rcx*1+0x8] # Load sizeclass
and eax,0x7f # |
shl rax,0x5 # |
lea r8,[sizeclass_meta_data] # |
mov rcx,QWORD PTR [rax+r8*1] # Load object size
mov r9,QWORD PTR [rax+r8*1+0x8] # Load slab mask
and r9,rdi # Offset within slab
mov rax,QWORD PTR [rax+r8*1+0x10] # Load modulus constant
imul rax,r9 # Perform recripocal modulus
shr rax,0x36 # |
imul rax,rcx # |
sub rcx,r9 # Find distance to end of object.
add rcx,rax # |
cmp rax,rdx # Compare to length of memcpy.
jb ERROR # |
DONE:
jmp <memcpy>
ERROR:
ud2 # Trap
```

## Performance

We measured the overhead of adding checks to various sizes of `memcpy`s.
We did a batch of 1000 `memcpy`s, and measured the time with and without checks.
The benchmark code can be found here: [Benchmark Code](../../src/test/perf/memcpy/)

![Performance graphs](./data/memcpy_perf.png)

As you can see, the overhead for small copies can be significant (60% on a single byte `memcpy`), but the overhead rapidly drops and is mostly in the noise once you hit 128 bytes.

When we actually apply this to more realistic examples, we can see a small overhead, which for many examples is not significant.
We compared snmalloc (`libsnmallocshim.so`) to snmalloc with just the checks enabled for bounds of the destination of the `memcpy` (`libsnmallocshim-checks-memcpy-only`) on the applications contained in mimalloc-bench.
The results of this comparison are in the following graph:

![Performance Graphs](./data/perfgraph-memcpy-only.png)

The worst regression is for `redis` with a 2-3% regression relative to snmalloc running without memcpy checks.
However, given that we this benchmark runs 20% faster than jemalloc, we believe the feature is able to be switched on for production workloads.

## Conclusion

We have an efficient check we can add to any block memory operation to prevent corruption.
The cost on small allocations will be higher due to the number of arithmetic instructions, but as the objects grow the overhead diminishes.
The memory overhead for adding checks is almost zero as all the dynamic meta-data was already required by snmalloc to understand the memory layout, and the small cost for lookup tables in the binary is negligible.

The idea can easily be applied to other block operations in libc, we have just done `memcpy` as a proof of concept.
If the feature is tightly coupled with libc, then an initialisation check could also be removed improving the performance.

[Next, we look at how to defend the internal structures of snmalloc against corruption due to memory safety violations.](./FreelistProtection.md)


# Thanks

The research behind this has involved a lot of discussions with a lot of people.
We are particularly grateful to Andrew Paverd, Joe Bialek, Matt Miller, Mike Macelletti, Rohit Mothe, Saar Amar and Swamy Nagaraju for countless discussions on guarded memcpy, its possible implementations and applications.
Loading

0 comments on commit d5c732f

Please sign in to comment.