Skip to content

Commit

Permalink
README: add SGX-Step root-cause info from recent AEX-Notify paper
Browse files Browse the repository at this point in the history
  • Loading branch information
jovanbulck committed Sep 2, 2023
1 parent 29aaea2 commit 784ab69
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 8 deletions.
47 changes: 39 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,6 @@ timer one-shot/periodic interrupt source, (iii) trigger inter-processor
interrupts, and (iv) register custom interrupt handlers completely _within_
user space.

![sgx-step-framework](doc/framework.png)

The above figure summarizes the sequence of hardware and software steps when
interrupting and resuming an SGX enclave through our framework.

Expand Down Expand Up @@ -259,10 +257,12 @@ initiate single-stepping for selected functions only, for instance by revoking
access rights on specific code or data pages of interest.

**Note (timer interval).** The exact timer interval value depends on CPU
frequency, and hence remains inherently platform-specific. Configure a suitable
value in `/app/bench/main.c`. We established precise timer intervals for our
evaluation platforms (see table above) by tweaking and observing the NOP
microbenchmark enclave instruction pointer trace results.
frequency, and hence remains inherently platform-specific (see also
[app/apic](app/apic) for detailed microbenchmarks assessing the accuracy of
various APIC timer modes). Configure a suitable value in `/app/bench/main.c`.
We established precise timer intervals for our evaluation platforms (see table
below) by tweaking and observing the NOP microbenchmark enclave instruction
pointer trace results, as further outlined below.

**Note (stability).** In order to avoid the Linux kernel getting stuck or
panicking, SGX-Step should automatically restore the interrupt-descriptor table
Expand Down Expand Up @@ -303,10 +303,41 @@ of single-steps, zero-steps, and multi-steps for a NOP slide of 100
instructions (once you have a more or less stable interval you can
switch to longer slides). Too many zero-steps indicate that you have to
increase the timer interval, whereas multi-steps demand lowering the
timer interval. Btw, don't worry when there's some zero-steps left, as
timer interval. _Important: do not worry when there are some zero-steps left, as
long as you make progress, you can always deterministically filter out
zero-steps by looking at the enclave's code PTE accessed bit (which is
only set when the instruction actually retires and a single-step occured).
only set when the instruction actually retires and a single-step occured)._

**Note (root cause).** A detailed root-cause analysis of how exactly SGX-Step
succeeds in reliably interrupting the first (possibly very short!) enclave
instruction following the notoriously complex `ERESUME` instruction is described in the
[AEX-Notify](https://jovanbulck.github.io/files/usenix23-aexnotify.pdf) paper.
We found that the key to SGX-Step's success lies in its use of the "accessed"
(A) bit. Specifically, SGX-Step always clears the A-bit in the victim en
clave's page-middle directory (PMD) before arming the APIC to fire a one-shot
interrupt. The A-bit is only ever set by the processor when at least one
instruction is executed by the enclave and can, hence, be used to
deterministically distinguish between zero-steps versus single-steps.

![assist window root-cause analysis](doc/root-cause.png)

Crucially, as the processor's page-miss handler is optimized for the common
fast path and uses a much slower "microcode assist" to handle the less frequent
and more complex case where a PMD or PTE needs to be modified, this assist has
the effect of prolonging the execution of the first enclave instruction
following `ERSUME` by several hundreds of cycles. This "assist window" thus
effectively opens a spacious landing space for the coarse-grained, normally
distributed APIC timer interrupt to arrive with high accuracy.

The slower the page table walk to resolve the (code) address of the first
enclave instruction following `ERESUME`, the longer the interrupt "landing
space" and, hence, the more reliable SGX-Step's single-stepping rate will be.
For instance, we found that the landing space can be even further extended by
flushing one or more of the unprotected page-table entries from the CPU cache
before `ERESUME`, effectively forcing the CPU to wait for slow memory during
the page-table walk. Thus, when you can't find a reliable timer-interval
configuration, consider flushing (`CLFLUSH`) one or more enclave page-table
entries in the AEP handler.

**Note (microcode).**
Another word of caution relates to recent Foreshadow/ZombieLoad/RIDL/etc
Expand Down
Binary file added doc/root-cause.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 784ab69

Please sign in to comment.