Unilateral unlinking of edge's TB causes infinite loop #52

cube0x8 · 2024-03-13T14:03:52Z

Context: my target allocates RWX pages, which are populated with jit code and executed.

I'm experiencing infinite loops when using the QemuCoverageEdgeHelper. The infinite loop occurs because the edges generated by libafl_gen_edge have the same pc as the last TB executed and because edges' TB is unlinked from its successor in the chain when the QEMU's signal handler is triggered.

I report below, step-by-step, a possible scenario that would trigger the infinite loop:

We are in cpu_exec_loop and about to execute a TB, named X from now on
During its execution, X encounters a jne to another TB, named Y
Code for Y is generated, and TBs X and Y are linked through an edge:
X -> edge -> Y
the important detail here is that the edge's TB has the same pc as X
Execution proceeds through several TBs until when another TB, named Z from now on, is executed
Z triggers the signal handler and, causality, one of the TB in the page that is going to be invalidated is Y:
tb_invalidate_phys_page_unwind


    PAGE_FOR_EACH_TB(addr, last, unused, tb, n) {
        if (current_tb == tb &&
            (tb_cflags(current_tb) & CF_COUNT_MASK) != 1) {
            /*
             * If we are modifying the current TB, we must stop its
             * execution. We could be more precise by checking that
             * the modification is after the current PC, but it would
             * require a specialized function to partially restore
             * the CPU state.
             */
            current_tb_modified = true;
            cpu_restore_state_from_tb(current_cpu, current_tb, pc);
        }
        tb_phys_invalidate__locked(tb); // TB IS Y
    }

By looking at point 3, we know that Y is referenced by edge so, when tb_jmp_unlink is executed, tb_reset_jmp is called on the edge's TB. Now edge's TB and Y are not linked anymore:
X -> edge -x-> Y
execution proceeds...
X is executed again, its basic blocks are executed until the edge's TB (remember, X -> edge), but edge is unlinked now, so after the edge is executed, it exits instead of jumping to Y
We are now back in cpu_exec_loop, edge is now last_tb, the pc returned by cpu_get_tb_cpu_state is the same as X, so tb_lookup will return X, X will be executed again, which will jump again to 9), resulting in an infinite loop between 9) and 10)

I see two possible problems here:

We are in a RWX context, where a page is accessed to be written, so potentially this means the JIT code in a page is going to be replaced from other code. In our example above, the page that is being re-written contains the TB named Y and, as we have to invalidate the edge X -> Y, we should also unlink X from edge to have the chain completely invalidated.
an edge shouldn't be initialized with the pc of the previous TB (so, with the X's pc)

I think a possible solution would be:
when the signal handler is triggered, and we are unlinking an edge -> TB2 relationship, we should also unlink the other side of the chain: TB1 -> edge.

I'm open to other suggestions.

The text was updated successfully, but these errors were encountered:

cube0x8 · 2024-03-13T15:13:46Z

I forgot an important detail between 9) and 10). I add it below:

X is executed again, its basic blocks are executed until the edge's TB (remember, X -> edge), but edge is unlinked now, so after the edge is executed, it exits instead of jumping to Y
We are now back in cpu_exec_loop, edge is now last_tb, the pc returned by cpu_get_tb_cpu_state is the same as X, so tb_lookup will return X. The check if (last_tb->jmp_reset_offset[1] != TB_JMP_OFFSET_INVALID) fails, since edge (now last_tb) has only 1 exit point, so last_tb and tb are chained together: tb_add_jump(last_tb, tb_exit, tb);. Current situation:
X -> edge -> X
X will be executed again, which will jump again to edge, which will jump to X again, and so on in an infinite loop...

rmalmain · 2024-03-14T15:45:31Z

Thanks a lot for the detailed walkthrough, it was very nice to follow!
There is a point in your explanation I am not sure to understand:

"We are now back in cpu_exec_loop, edge is now last_tb, the pc returned by cpu_get_tb_cpu_state is the same as X, so tb_lookup will return X." -> At this point last_tb should be X, not edge, no? (since X and edge are linked together). Also, cpu_get_tb_cpu_state should return the pc right after X, not the pc of X if I'm not wrong, since the call to ops->tb_stop(db, cpu); will make the TB update the pc at the end of its execution.

I'm not sure about my remark since I can't test easily with an example, so maybe I missed something. I also suspect there is something fishy around if (last_tb->jmp_reset_offset[1] != TB_JMP_OFFSET_INVALID).

andreafioraldi · 2024-03-14T15:47:30Z

that check is to avoid to instrument single exit blocks like direct jmp, they are not branches Il gio 14 mar 2024, 16:45 Romain Malmain ***@***.***> ha scritto:

…

Thanks a lot for the detailed walkthrough, it was very nice to follow! There is a point in your explanation I am not sure to understand: 1. "We are now back in cpu_exec_loop, edge is now last_tb, the pc returned by cpu_get_tb_cpu_state is the same as X, so tb_lookup will return X." -> At this point last_tb should be X, not edge, no? (since X and edge are linked together). Also, cpu_get_tb_cpu_state should return the pc right after X, not the pc of X if I'm not wrong, since the call to ops->tb_stop(db, cpu); will make the TB update the pc at the end of its execution. I'm not sure about my remark since I can't test easily with an example, so maybe I missed something. I also suspect there is something fishy around if (last_tb->jmp_reset_offset[1] != TB_JMP_OFFSET_INVALID). — Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD3LJ6VYB5XXC7P3O22X2IDYYHA3FAVCNFSM6AAAAABEUHK7T6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXG43DGNRXGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rmalmain · 2024-03-14T16:05:46Z

Yes, I meant this is indeed where we would create the infinite loop if we are in the conditions he describes (last_tb == edge and pc == X). I'm surprised there is an infinite loop there tough, if my scenario is the one being executed then last_tb == X and pc == Y. In that case we should just fall back to the original generation of edge and Y no?

andreafioraldi · 2024-03-14T16:07:36Z

Yes, I supspect like a cache invalidation in between, that would explain why he cannot reproduce on a small example program Il gio 14 mar 2024, 17:06 Romain Malmain ***@***.***> ha scritto:

…

Yes, I meant this is indeed where we would create the infinite loop if we are in the conditions he describes (last_tb == edge and pc == X). I'm surprised there is an infinite loop there tough, if my scenario is the one being executed then last_tb == X and pc == Y. In that case we should just fall back to the original generation of edge and Y no? — Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD3LJ6V3WE2HCD42ZKQ27QDYYHDHBAVCNFSM6AAAAABEUHK7T6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXHAYDMNZSG4> . You are receiving this because you commented.Message ID: ***@***.***>

cube0x8 · 2024-03-18T15:35:55Z

At this point last_tb should be X, not edge, no?

No? From my debugging, within libafl_gen_code, tcg_out_exit_tb initiates a lea instruction, with its operand being the relative address to the edge's tb address. This address will be the return address of tcg_qemu_tb_exec, which subsequently will be last_tb. It seems logical to me, considering edge is a distinct TB. Thus, why should the exit code load the TB of X rather than that of the edge?

if my scenario is the one being executed then last_tb == X and pc == Y

This is true for regular TBs. However, for edge TBs, the code responsible for updating env->pc upon exit is not present. Below, I've reported the two lines of code that typically update the PC. These lines are present in a standard TB but missing in an edge's TB:

0x7fffe8000130 <code_gen_buffer+259>:    lea    rbx,[rip+0xe862cb9]  // load Y's pc
0x7fffe8000137 <code_gen_buffer+266>:    mov    QWORD PTR [rbp+0x80],rbx // store it into env->eip
0x7fffe800013e <code_gen_buffer+273>:    lea    rax,[rip+0xfffffffffffffefb] // load current TB's address
0x7fffe8000145 <code_gen_buffer+280>:    jmp    0x7fffe8000018 // jump to epilogue

and the lines below are taken from edge's generated code:

   0x7fffe8a5f480:	movabs rbx,0x555556e353b0
   0x7fffe8a5f48a:	movzx  r12d,BYTE PTR [rbx+0x2306]
   0x7fffe8a5f492:	inc    r12d
   0x7fffe8a5f495:	mov    BYTE PTR [rbx+0x2306],r12b
   0x7fffe8a5f49c:	data16 xchg ax,ax
   0x7fffe8a5f49f:	jmp    0x7fffe8a5f4a4 // jump to Y
   0x7fffe8a5f4a4:	lea    rax,[rip+0xffffffffffffff15]  // load edge's TB address
   0x7fffe8a5f4ab:	jmp    0x7fffe8000018 // jump to epilogue

As you can see, in the code above, the part that updates env->eip (rbp+0x80) is missing.

To be honest, I haven't went through the specifics of the ops generation part, so I can't say for sure why the PC is updated exclusively for regular TBs and not for edges'. However, I can somewhat understand why this occurs: what would be the next PC of an invalidated edge? Conceptually speaking, it doesn't seem reasonable to me for an edge to update the PC of the left-side TB (the TB I referred to as X in my example), as its sole purpose is to execute hooks and transition to Y.

Yes, I supspect like a cache invalidation in between, that would explain why he cannot reproduce on a small example program

@andreafioraldi could you explain what is it this cache you mentioned? I believe reproducing this issue on a small program (e.g., examples within the fuzzers directory) might be not possible because there might not be any QEMU user-mode example that replicates the scenario I'm facing (namely, edges between RWX regions being invalidated by the signal handler).

Btw, I think that with some effort I can create a small program that reproduces the issue but I would do that only if really necessary, as it would require a considerable amount of time.

cube0x8 · 2024-03-27T22:49:27Z

Btw, I think we cannot just simply unlink edge from its predecessor but we should call do_tb_phys_invalidate on it...

rmalmain · 2024-04-26T14:50:29Z

Fixed in #53.

cube0x8 changed the title ~~Dummy pc in the libafl_gen_edge generated TB generates an infinite loop~~ Dummy pc in the libafl_gen_edge generated TB, generates an infinite loop Mar 13, 2024

cube0x8 changed the title ~~Dummy pc in the libafl_gen_edge generated TB, generates an infinite loop~~ Unilateral unlinking of edge's TB causes infinite loop Mar 13, 2024

This was referenced Mar 13, 2024

add a flag for edge TB so we can unlink it completely #53

Merged

Incorrect node type casting in libafl_maps_next leads to out-of-bound dereference #46

Closed

rmalmain closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unilateral unlinking of edge's TB causes infinite loop #52

Unilateral unlinking of edge's TB causes infinite loop #52

cube0x8 commented Mar 13, 2024 •

edited

Loading

cube0x8 commented Mar 13, 2024 •

edited by rmalmain

Loading

rmalmain commented Mar 14, 2024 •

edited

Loading

andreafioraldi commented Mar 14, 2024 via email

rmalmain commented Mar 14, 2024

andreafioraldi commented Mar 14, 2024 via email

cube0x8 commented Mar 18, 2024 •

edited

Loading

cube0x8 commented Mar 27, 2024

rmalmain commented Apr 26, 2024

Unilateral unlinking of edge's TB causes infinite loop #52

Unilateral unlinking of edge's TB causes infinite loop #52

Comments

cube0x8 commented Mar 13, 2024 • edited Loading

cube0x8 commented Mar 13, 2024 • edited by rmalmain Loading

rmalmain commented Mar 14, 2024 • edited Loading

andreafioraldi commented Mar 14, 2024 via email

rmalmain commented Mar 14, 2024

andreafioraldi commented Mar 14, 2024 via email

cube0x8 commented Mar 18, 2024 • edited Loading

cube0x8 commented Mar 27, 2024

rmalmain commented Apr 26, 2024

cube0x8 commented Mar 13, 2024 •

edited

Loading

cube0x8 commented Mar 13, 2024 •

edited by rmalmain

Loading

rmalmain commented Mar 14, 2024 •

edited

Loading

cube0x8 commented Mar 18, 2024 •

edited

Loading