Skip to content

Kernel thread stop reap blocked#39

Open
andreykarpenko-qc wants to merge 7 commits into
masterfrom
kernel_thread_stop_reap_blocked
Open

Kernel thread stop reap blocked#39
andreykarpenko-qc wants to merge 7 commits into
masterfrom
kernel_thread_stop_reap_blocked

Conversation

@andreykarpenko-qc

Copy link
Copy Markdown
Contributor

No description provided.

Comment thread kernel/event/intpool/test_h2/test.c Outdated
puts("Joining worker 0 thread");
pthread_join(intpool_child_1, &ret);
puts("Joining worker 1 thread");
// stop_threads = 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove code lines which are in comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Returning from main (or calling exit()) needs to tear down the entire VM,
not just the calling thread.  Previously sys_exit() funneled into
pthread_exit, which left peer threads — especially ones blocked in a
futex (H2K_STATUS_BLOCKED) or intpool wait (H2K_STATUS_INTBLOCKED) —
stuck forever, leaking the vmblock.

Introduce a new H2_TRAP_VM_STOP (#7) syscall vector dispatched to
H2K_vm_stop, and route sys_exit through h2_vm_stop_trap unconditionally
(any status, not just zero).  H2K_vm_stop scans the vmblock's context
array and reaps every non-DEAD peer regardless of state:

  * BLOCKED  — remove from futex hash, cancel timer
  * INTBLOCKED — remove from intpool ring
  * READY / VMWAIT — remove from runlist
  * RUNNING peers on other HW threads — flagged via vmblock->exiting
    and IPI'd; resched.ref.c picks up the flag and self-reaps the
    context before scheduling

Each reap path performs the common cleanup (ASID refcount,
context clear, return to free list, num_cpus--), then the new helper
H2K_vmblock_finalize_if_done_locked() signals the parent VM and frees
the vmblock once num_cpus reaches zero.

Track which context is "main" so future policy can hinge on it:
vmblock->main_context is set by setup.ref.c for the boot VM and by
create.ref.c for the first thread of every other VM.

Trap-table test (kernel/event/trap/test/test.c) and the intpool h2 test
(kernel/event/intpool/test_h2/test.c) are updated for the new vector
and the new teardown semantics — the latter no longer needs to manually
join workers since main's return now reaps them.

Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
Add 36 self-contained tests under libs/posix/pthread/test_h2/ exercising
the pthread/sem/rwlock/barrier/TLS surface and the H2K_vm_stop teardown
path introduced in the previous commit.  Each test is a directory with
test.c + Makefile + Makefile.inc following the existing test_h2
convention.

Register the new tests in scripts/testlist.v61 and scripts/testlist.v81
so they run as part of the unified test suite.  pthread_exit_main is
checked in but commented out in both testlists pending follow-up.

Coverage groups:

  * Basic API:  attr_roundtrip, barrier_basic, cond_signal_broadcast,
                detach_states, join_basic, join_invalid, mutex_recursive,
                mutex_trylock, rwlock_readers_writers, sem_corner,
                tls_keys

  * exit() teardown (exercises H2K_vm_stop):
                exit1_main_mutex, exit1_main_cond_wait,
                exit1_worker_cond_wait, exit1_detached_worker_with_stuck

  * pthread_exit-from-main (POSIX: terminate caller, keep process alive):
                pthread_exit_main

  * Blocked-thread reap on VM exit (workers parked in sync primitives
    when main returns):
                stuck_in_join, stuck_in_mutex, stuck_in_cond_wait,
                stuck_in_cond_timedwait, stuck_in_sem_wait,
                stuck_in_sem_timedwait, stuck_in_rwlock_rd,
                stuck_in_rwlock_wr, stuck_in_barrier,
                stuck_in_pthread_exit_joined

  * Negative / misuse:
                neg_attr_setstacksize_zero, neg_barrier_init_zero,
                neg_cond_wait_no_mutex, neg_create_invalid_routine,
                neg_join_self, neg_mutex_destroy_held,
                neg_mutex_unlock_unowned, neg_rwlock_unlock_unheld,
                neg_sem_overflow_post, neg_tls_use_after_delete

Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
Three independent fixes in the test build/run plumbing surfaced once
multiple ARCHV×TARGET variants started running in parallel and writing
into a shared in-source test directory.

makefile:
  * Per-variant test_results.json rule prefixes the inner $(MAKE) with
    '-' so unified-report aggregation runs even when one variant's
    tests fail.  The JSON file is still written by h2_test before
    check-fail runs.
  * h2_test reorders steps so test_report.html and test_results.json
    are generated *before* the warning-grep gate; previously a stray
    warning blocked report generation entirely.

scripts/Makefile.coverage:
  * Per-test results.txt rule's inner $(MAKE) gets the '-' prefix so
    .DELETE_ON_ERROR doesn't wipe FAIL details and abort tst — PASS/FAIL
    is captured in results.txt content for check-fail and
    gen_test_results.py.
  * Symlink whitelist replaces the old "everything except Makefile.inc"
    blacklist when populating the per-variant build dir.  Without the
    whitelist, every variant followed symlinks back to the source tree
    and clobbered each other's *.elf / *.o / results.txt / gmon-*.out,
    leaving only the last writer's outputs in the report.  Whitelist
    covers source extensions (Makefile, *.c/.h/.S/.s/.cpp/.cc/.py/.dat/
    .cfg, tested_functions); explicit excludes drop generator outputs
    whose extension matches a source extension (scenarios.h,
    generated_tests.dat, threadmap.py).

scripts/Makefile.inc.test:
  * Add 'test' as an alias for 'tst' and mark test/tst/all .PHONY so
    they don't collide with stray files of the same name.

Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
Per POSIX, pthread_exit from the main thread terminates only the main
thread; the process must remain alive so other threads run to
completion.  Two distinct bugs together broke that; both are fixed here,
and pthread_exit_main is enabled in v61 and v81 testlists to lock the
behavior in.

kernel/thread/stop: drop is_main shortcut

H2K_thread_stop (H2_TRAP_THREAD_STOP, used by pthread_safe_death)
treated main_context exiting as a VM teardown -- calling vm_stop_locked
to reap every sibling.  That collapsed pthread_exit and exit() into the
same fatal path.  Only exit() routes through H2_TRAP_VM_STOP, so the
kernel can distinguish: thread_stop now does the same per-me cleanup
for main as for any worker, then nulls main_context so the existing
all-blocked reaper at the bottom of the function can finalize the VM
cleanly when remaining threads settle.  exit()/sys_exit() still gets
full VM teardown via H2K_vm_stop, untouched.

libs/posix/pthread: don't queue main's TLS for deferred free

pthread_exit defers freeing the exiting thread's TCB+TLS via static
old_freeptr / old_stack_freeptr -- set so the next exiting thread frees
the previous one.  The math (char *)self - elftls_size only works for
worker threads (calloc'd by pthread_create with TCB at
tmpptr + elftls_size).  For main:

  * Small ELF TLS path: TCB lives in mainthread_static_storage (BSS).
    Freeing into BSS is undefined.
  * Large ELF TLS path: TCB sits at malloc'd + alignment correction --
    the math gives main_thread_tls, not the malloc return.

Use aligned_alloc(TLS_ALIGN, round_up(size, TLS_ALIGN)) in the
large-TLS path so the math is correct (no manual alignment fixup), and
track main's TCB via a file-scope main_tcb pointer.  pthread_exit on
main consumes the previously queued frees but skips queueing itself,
so neither static storage nor a non-malloc'd interior pointer ever
reaches free().

scripts/testlist.v{61,81}: enable pthread_exit_main

The test (added in 649bddf) exercises main calling pthread_exit with
a mix of joinable and detached worker threads.  With both fixes above,
it passes under archsim and hexagon-sim.

Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
Mirrors pthread_exit_main except no worker calls exit(): every worker
just `return NULL`s from its start routine, and the last one to bump
the shared counter prints TEST PASSED before returning.  Main still
leaves via pthread_exit(NULL).  This forces VM teardown through the
all-blocked / all-dead reaper path -- nobody triggers the
H2_TRAP_VM_STOP shortcut -- and locks in a clean exit status from that
path.  Registered in scripts/testlist.v61 and scripts/testlist.v81.


Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
@andreykarpenko-qc andreykarpenko-qc force-pushed the kernel_thread_stop_reap_blocked branch from 3c19e13 to 6ac16fe Compare June 11, 2026 09:09
@andreykarpenko-qc andreykarpenko-qc marked this pull request as ready for review June 11, 2026 09:12

@eplondke Erich Plondke (eplondke) left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great feature, it's something we need pretty badly: the ability to kill another thread, and building on top of that being able to kill a vm/process/etc.

There's some partial support for killing another task and we test it at least somewhat. If we set the H2K_VMSTATUS_KILL bit in a thread it's supposed to wake up and call H2K_thread_stop(). But we currently don't ever set that bit outside of testing! :-) But I think that infrastructure might be the best way to reuse some of the other infrastructure to get a thread killed off regardless of its state.

Conceptually, it seems like then the "vm death" sequence is:

  • Mark the VM as "being killed"
  • Mark all threads in the VCPU for death.
  • Possibly, make sure we don't create new VCPUs in a VM "being killed"
  • If the number of VCPUs in the VM goes to zero, we either "signal the parent" (like we might already do?) or "just tear down the VM" as a configured option.
  • Hopefully this works if a guest is asking for itself to be torn down, or the parent is tearing down the child.

This seems pretty great, but I'm hopeful that we can leverage the existing infrastructure a little more if it's working. If it's not working... maybe we can fix it... or if it's too broken then we can create something new like this.

Let me know if you want to chat live about this!

Comment thread kernel/data/vm/vmblock.h Outdated
/* Main (first-created) thread of this vmblock. When this context calls
* H2K_thread_stop the entire vmblock is torn down, matching POSIX
* exit()/return-from-main semantics. NULL once main has exited. */
H2K_thread_context *main_context;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best mechanism to be able to kill an entire VM. There is some partial support for H2K_VMSTATUS_KILL but nothing ever sets the bit and gets started. I think having a VMOP that kills a specific CPU and/or whole guest is a great one, but let's try to use the VMSTATUS_KILL infra...

Comment thread kernel/sched/resched/resched.ref.c Outdated
}

static inline void resched(u32_t unused, H2K_thread_context *me, u32_t hwtnum) {
if (me != NULL && me->vmblock != NULL && me->vmblock->exiting) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid this special case

Comment thread kernel/sched/resched/resched.ref.c Outdated
* Used when a CLUSTER_RESCHED_INT (or RESCHED_INT) lands on a HW thread that
* was running a context belonging to a vmblock whose main thread has exited.
* Mirrors the per-context cleanup in H2K_thread_stop. */
static inline void self_reap_locked(H2K_thread_context *me)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mostly the same thing that would happen with H2K_thread_stop? we should reuse that as much as possible... and if there are things missing in H2K_thread_stop we should make sure those go there too...

Comment thread kernel/thread/create/create.ref.c Outdated
vmblock->num_cpus++;
tmp->vmblock = vmblock;

/* Record the first thread as main: its exit triggers VM teardown. */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to tear down the whole VM? Or just put it in all-threads-stopped where the spawning task frees it (current approach)? This is kind of like "zombie" and needing to join in posix? Although also that's a hassle sometimes :-)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think that we should have a concept of the "main" thread (vcpu) of a vm. For a paravirtualized guest like linux, the vcpus are just a symmetric set of cpus -- there is no "main" cpu that stops all the others when it stops -- in most architectures anyway -- and I don't think that's something we should introduce in the HVM architecture.
So it's ok (and necessary) to have the concept of a main posix thread, but that shouldn't map to a particular vcpu even though in h2os that's how we implement it.
pthread_exit() of the main thread should kill all the other pthreads, but not (directly) all the vcpus. It's quite possible that there is a vcpu that is not running the code of a pthread.
Therefore I think the mechanisms for "stop all pthreads" and "stop the vm" should be separate, logically, though they can share some infrastructure.

Comment thread kernel/thread/stop/stop.h
#define H2K_STOP_H 1

#include <context.h>
#include <vmblock.h>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread stop and vm stop should probably be separated (although perhaps one should call the other?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that. Should have kept reading...

@bryanb-h2

Copy link
Copy Markdown
Contributor

This is a great feature, it's something we need pretty badly: the ability to kill another thread, and building on top of that being able to kill a vm/process/etc.

There's some partial support for killing another task and we test it at least somewhat. If we set the H2K_VMSTATUS_KILL bit in a thread it's supposed to wake up and call H2K_thread_stop(). But we currently don't ever set that bit outside of testing! :-) But I think that infrastructure might be the best way to reuse some of the other infrastructure to get a thread killed off regardless of its state.

Conceptually, it seems like then the "vm death" sequence is:

* Mark the VM as "being killed"

* Mark all threads in the VCPU for death.

* Possibly, make sure we don't create new VCPUs in a VM "being killed"

* If the number of VCPUs in the VM goes to zero, we either "signal the parent" (like we might already do?) or "just tear down the VM" as a configured option.

* Hopefully this works if a guest is asking for itself to be torn down, or the parent is tearing down the child.

This seems pretty great, but I'm hopeful that we can leverage the existing infrastructure a little more if it's working. If it's not working... maybe we can fix it... or if it's too broken then we can create something new like this.

Let me know if you want to chat live about this!

Currently, when the last vcpu stops we send H2K_VM_CHILDINT to the parent vcpu if it's still alive (!H2K_STATUS_DEAD), then tear down the vm unconditionally. We also signal the parent if any child vcpu exits non-0.
Maybe this should be a shint instead so the parent doesn't have to hang around, but it seemed kind of good practice at the time I added this to expect the parent to wait (blocked) until the child vm finishes. This was intended mainly to wake up booter when stuff finishes.

We can certainly add a vmop to set H2K_VMSTATUS_KILL on a vm (child only) and then interrupt any running vcpus from that vm so that they go do vmwork and discover that they need to terminate.

@github-actions github-actions Bot added the untested Mark untested PRs label Jun 25, 2026
Three independent kernel cleanups:

- Remove the H2K_vmblock_t::main_context field and all its bookkeeping.
  Main and worker threads now share the H2K_thread_stop path uniformly;
  VM-wide teardown is driven solely by H2K_vm_stop (the exiting flag),
  so tracking which context was "main" is no longer needed. Drops the
  field from vmblock.h and its assignments in setup/create/stop.

- Factor the duplicated per-context teardown tail (asid_dec ->
  context_clear -> push free list -> num_cpus--) into a shared
  H2K_free_context_locked() exported from stop.h. Collapses the four
  open-coded copies in reap_one_locked, vm_stop_locked, H2K_thread_stop,
  and resched's self-reap onto it.

- Fold the exiting-VM self-reap in resched() into the normal
  if (me != NULL) path instead of a separate early-return branch.
  Removes the self_reap_locked helper along with its duplicated
  runlist_remove/dosched and the unreachable tail, leaving a single
  exit point. On a reap, me is set to NULL so the shared dosched(me)
  tail won't save the dying context.

No behavior change. Verified: clean build + full v81/opt test suite
(136 passed, 0 failed).

Signed-off-by: Andrey Karpenko <andreyk@qti.qualcomm.com>
@andreykarpenko-qc andreykarpenko-qc force-pushed the kernel_thread_stop_reap_blocked branch from 9f4f231 to 6653e29 Compare June 25, 2026 20:11
@eplondke

Copy link
Copy Markdown
Contributor

The tests here are really awesome. They found some great problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

untested Mark untested PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants