Skip to content

WIP - PlatformAudio instability investigation#180

Draft
alan-george-lk wants to merge 11 commits into
mainfrom
feature/platform-audio-stability
Draft

WIP - PlatformAudio instability investigation#180
alan-george-lk wants to merge 11 commits into
mainfrom
feature/platform-audio-stability

Conversation

@alan-george-lk

Copy link
Copy Markdown
Collaborator

No description provided.

alan-george-lk and others added 2 commits June 22, 2026 21:25
Drop nightly.yml (superseded by platform-audio-triage.yml). Make the
triage workflow the single focused, mac-only crash-hunting tool:

- Remove the temporary pull_request trigger (workflow_dispatch only) so it
  stops doing a ~20-minute build on every PR push.
- Cache the Rust submodule build (Swatinem/rust-cache) to skip the cold
  build on re-runs.
- Raise dispatch defaults to repeat=500 / pin_iterations=200 now that the
  test loop is confirmed cheap relative to the build.

Co-authored-by: Cursor <cursoragent@cursor.com>
@xianshijing-lk

Copy link
Copy Markdown
Collaborator

could you please provide some descriptions of problems for this investigation ?

alan-george-lk and others added 9 commits June 23, 2026 09:19
Instrument both triage arms with a background sampler that records RSS,
thread count, fd count, and mach-port count of the integration-test
process over time. Mach-port growth is the tell for a CoreAudio HAL
client leak across ADM dispose/recreate cycles. CSVs are uploaded as
artifacts and first/last deltas are surfaced in the job summary.

Also lower the default repeat from 500 to 200 so Arm A finishes and
yields a full leak curve instead of timing out.

Co-authored-by: Cursor <cursoragent@cursor.com>
The resource sampler matched its own command line (argv contains the
process pattern) and picked it via head -1, so it measured an idle 1-thread
shell instead of the test binary. Select the matching PID with the largest
RSS instead, excluding the sampler's own PID, so it tracks the real
instrumented binary.

Drop --gtest_break_on_failure from the triage arms: it converted ordinary
EXPECT failures (e.g. "no platform audio frames received") into SIGTRAP
core dumps (a misleading ~2GB artifact) and halted the repeat loop before
the sampler could capture the full curve.

Add Arm C, which runs only PlatformAudioFramesReachRemote with a small
repeat, to distinguish "frame flow dead on a fresh ADM" from "frame flow
only dies after prior teardown/recreate cycles churn the ADM".

Co-authored-by: Cursor <cursoragent@cursor.com>
The instability reproduces on Apple Silicon too (arm64 integration tests
have been seen to SIGSEGV, exit 139), not just Intel x64. Replace the
single-runner input with a matrix that fans "all" out across one Intel
(macos-15-large) and one arm64 (macos-15) runner by default, while still
allowing a single runner to be targeted. Artifact names are suffixed with
the runner so the parallel arms don't collide on upload.

Co-authored-by: Cursor <cursoragent@cursor.com>
The corrected resource sampler showed RSS growing unbounded (to ~4.9 GB on
Intel before it crashed) while threads, fds, and mach ports stay flat, and
the growth reproduces even with the ADM pinned -- i.e. a heap leak in the
per-room publish/subscribe cycle, not an ADM-teardown or handle leak.

Add Arm D, which runs the pinned-cycle reproducer under macOS `leaks
--atExit` with MallocStackLogging so each still-allocated block is reported
with its allocating backtrace (symbol + file:line). This names the leaking
call site (C++ SDK vs Rust FFI) directly. The report is uploaded as an
artifact and a small leak_iterations input keeps the stack-logging overhead
bounded.

Co-authored-by: Cursor <cursoragent@cursor.com>
The leak is reachable retention, not lost-pointer leaks: `leaks` reports 0
on both arches because the growing memory is still referenced. `leaks
--atExit` also can't see it (cleanup reclaims at shutdown). Code review
ruled out the obvious C++ suspects -- the FFI response buffer handle is
already dropped via an FfiHandle guard in sendRequest, and Room deregisters
its FfiClient listener on disconnect/destruction.

Add Arm E, which runs the dispose+recreate path (the worst leaker) under
MallocStackLogging and samples the LIVE heap mid-run via `heap` and
`malloc_history` (new heap_snapshots.sh). Diffing successive heap summaries
plus the malloc_history stacks names the growing allocation type and its
call site so we can localize the retention to C++ SDK vs Rust FFI vs WebRTC.

Co-authored-by: Cursor <cursoragent@cursor.com>
First run plateaued before the snapshot window opened, so all heap samples
were identical and malloc_history (gated to the last ticks) never fired. The
steady-state heap was already telling: dominated by webrtc::Codec copies and
StatsReport entries in liblivekit_ffi.dylib.

Snapshot every 10s (not 25s), allow more ticks, raise the dispose-path repeat
to 60 so the process keeps churning across the window, and capture
malloc_history (-allBySize | head) on every tick so we always get the
allocating backtraces even if the process exits or hangs early.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve submodule conflict by keeping the latest bugfix/ffi_handle_cleanup
commit (6881168d) instead of main's dynacast bump.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants