Skip to content

ExecutionEngineException fail-fast in GC thread-suspension / return-address hijack while root-scanning a Win32 message-pump P/Invoke frame (.NET 10.0.8, Windows, Avalonia desktop) #129071

@steveh-iz

Description

@steveh-iz

Description

A long-running desktop app (Avalonia 12, Win32 windowing backend) reliably fail-fasts with a runtime-raised System.ExecutionEngineException (HRESULT 0x80131506, faulting module coreclr.dll) every ~8–80 minutes of idle/normal use. The exception is raised by the runtime itself from Debugger::HandleFatalErrorRaiseFailFastException, so it bypasses AppDomain.UnhandledException and cannot be caught.

Using DOTNET_StressLog=1 + a full minidump (DOTNET_DbgEnableMiniDump=1, DbgMiniDumpType=4) I captured the runtime's own fatal-event trail across three independent crashes. All three show the same mechanism: the GC's thread-suspension / return-address-hijack machinery raises the fatal error while suspending a thread stopped at a non-fully-interruptible JIT point (fullyInt=0), while the main UI thread is being GC-root-scanned inside the Win32 message pump's DispatchMessage(MSG) P/Invoke frame, during a routine (not pressure-driven) GC.

Disabling CET (<CetCompat>false</CetCompat>) stops the crash completely (confirmed by a 2+ hour soak — see below), which strongly implicates CET hardware shadow-stack enforcement colliding with the GC's return-address hijack.

Frequency / Severity

  • Reproduces every ~8–80 minutes of runtime. Fatal (process dies); no managed handler can intercept.
  • Reproduced across two different builds of the app and two render modes (GPU/ANGLE and CPU/software) — 3 captured dumps total.

StressLog — the fatal sequence (representative; identical shape in all 3 dumps)

<tid> <ts> : SYNC   Stopped in Jitted code at pc=...  sp=...  fullyInt=0
<tid> <ts> : SYNC   Hijacking return address ...  for thread ...
<tid> <ts> : CORDB  D::HFE: About to call LogFatalError            <-- FATAL
<tid> <ts> : EH     SetLastThrownObject: obj=... (ExecutionEngineException)
<tid> <ts> : CORDB  D::RFFE: About to call RaiseFailFastException  <-- FAIL-FAST
<tid> <ts> : SYNC   Thread::SuspendAllThreads() - Success
  • Exactly one fatal event per crash. Thrown object is the System.ExecutionEngineException singleton (same MethodTable across all 3 crashes), _message=null, HResult=0x80131506, exception code 0xE0434352 → runtime-raised, not a managed throw.
  • At every SuspendAllThreads (including the fatal one), the main UI thread shows Scanning Frameless method IL_STUB_PInvoke(MSG ByRef) — i.e. it is parked in the Win32 GetMessage/DispatchMessage(MSG) P/Invoke and being root-scanned.
  • The thread being hijacked at the fatal moment varies by crash and is doing ordinary work:
    • Crash A: main UI thread mid-Measure (layout).
    • Crash B: a non-main thread.
    • Crash C: a thread inside System.DateTimeOffset.ValidateOffset / ..ctorpure CoreLib, no app/render frames (a periodic timer's timestamp path).
  • The GC at the fatal point is routine: e.g. Crash C was BEGINGC ... requested generation = 1 — an ordinary periodic gen-1, ~1 GC / 4 s, ~35 MB heap.

Minimal characterization of the invariant

During a routine GC, while the main UI thread is suspended inside the Win32 DispatchMessage(MSG) P/Invoke frame being root-scanned, the runtime hijacks the return address of another thread stopped at a fullyInt=0 JIT point and immediately enters Debugger::HandleFatalError → raises ExecutionEngineExceptionRaiseFailFastException.

The hijack-victim thread's work is incidental (layout in one crash, DateTimeOffset ctor in another) — the constant is the suspend/hijack step itself faulting, with the pump P/Invoke frame present in the root scan.

Disabling CET stops it — confirmed

  • Building the app with <CetCompat>false</CetCompat> (opts the exe out of CET shadow stacks; apphost PE CET_COMPAT bit verified flipped 1→0 via dumpbin /headers) eliminates the crash: the CET-off build ran 125+ minutes with zero EEE and was still healthy, versus an 8–80 minute crash baseline on every CET-on build (3 dumps, 2 binaries, GPU and software render).
  • This points at CET hardware shadow-stack enforcement as the trigger: a GC return-address hijack sets a thread's return address to a stub that is not on the shadow stack, which CET treats as a control-flow violation and terminates the process. The runtime has a CET-safe suspension path using special user-mode APCs (QueueUserAPC2); this looks like a case where the hijack path is taken while shadow stacks are active.
  • <CetCompat>false> is a usable mitigation but disables a security feature for the whole process — an in-runtime fix (always take the CET-safe suspension path when shadow stacks are enabled) would be preferable.

Ruled out (with evidence)

  1. GPU / Skia / ANGLE render path — exonerated. Forced CPU software rendering (no av_libGLESv2.dll / av_libEGL.dll / nvwgf2umx.dll / d3d11.dll / dxgi.dll / opengl32.dll / vulkan-1.dll loaded — verified by module-list diff vs the GPU-mode dumps) still crashed with the identical mechanism in ~8 min, on a thread doing DateTimeOffset construction with zero render frames.
  2. Allocation pressure — exonerated. Fatal GCs are routine periodic gen-0/gen-1 at ~1/4 s with a ~35 MB heap; a low-allocation run crashed faster than a high-allocation one. No GC heap corruption (verifyheap clean).
  3. Application exception storm — exonerated. An earlier high-rate InvalidDataException storm (an app-side deserialization bug, unrelated to the runtime) was fixed; with it gone (InvalidDataException = 0 across a 273 MB StressLog) the EEE still reproduced. Background exception rate at crash is ~1/8 s, not a storm.

Environment

  • .NET: 10.0.8 (DAC 10.0.826.23019)
  • OS: Windows 11, build 26200 (x64)
  • UI framework: Avalonia 12, Win32 windowing backend, classic desktop lifetime, single-threaded WPF-style UI message pump
  • TieredCompilation already disabled; reproduces with and without tiering knobs.
  • App is mostly idle at crash time (periodic timers + UI message pump + routine GC).

Possibly related

Repro assets available on request

  • 3 full minidumps (Type=4) with armed StressLog (DOTNET_StressLog=1, 128 MB/thread ring); the two storm-free dumps are cleanest (~273 MB each, can be trimmed to the fatal window).
  • clrstack / clrthreads / threads / module-list captures per dump.

Questions for the runtime team

  1. Is this a known issue in the return-address hijack / EEPolicy::HandleFatalError path on Windows when a thread is suspended at a fullyInt=0 point while another thread's IL_STUB_PInvoke(MSG ByRef) frame is being root-scanned?
  2. CET: when shadow stacks are enabled, under what conditions can the GC take the hijack suspension path (return-address overwrite) instead of the CET-safe QueueUserAPC2 path? We've confirmed <CetCompat>false> stops the crash — is that the recommended mitigation on 10.0.8, or is an in-runtime fix planned?
  3. Do any suspension-timing knobs change time-to-crash as a race corroborator: DOTNET_gcServer=1, DOTNET_GCgen0size=<larger>, DOTNET_TieredPGO=0, DOTNET_JITMinOpts=1? (DOTNET_ThreadSuspendInjection is Unix-only and does not apply on Windows.)
  4. Is there a recommended mitigation for desktop apps on 10.0.8 short of pinning an earlier runtime, other than opting out of CET?

Suggested areas: area-VM-coreclr, area-GC-coreclr.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-VM-coreclrneeds-author-actionAn issue or pull request that requires more info or actions from the author.untriagedNew issue has not been triaged by the area owner

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions