Skip to content

FfiRoom::connect blocks ConnectCallback on audio-filter on_load() with no timeout — network stall turns into a multi-minute connect hang #1163

Description

@yepher

Description

In FfiRoom::connect, after Room::connect() succeeds, registered audio-filter plugins are initialized via filter.on_load(&req.url, &req.token) in a spawn_blocking task, and the ConnectCallback is not sent to the FFI client until that completes:

// initialize audio filters
let result = server
.async_runtime
.spawn_blocking(move || {
for filter in registered_audio_filter_plugins().into_iter() {
filter.on_load(&req.url, &req.token).map_err(|e| e.to_string())?;
}
Ok::<(), String>(())
})
.await
.map_err(|e| e.to_string());
match result {
Err(e) | Ok(Err(e)) => {
log::warn!("error while initializing audio filter: {}", e);
log::error!(
"audio filter cannot be enabled: ensure you are connecting to LiveKit Cloud and that the filter is properly configured"
);
// Skip returning an error here to keep the rtc session alive
// But in this case, the filter isn't enabled in the session.
}
Ok(Ok(_)) => (),
};

on_load for the noise-cancellation plugin performs a blocking HTTPS request to the LiveKit Cloud endpoint. There is no timeout around the plugin call. If that request stalls (DNS resolver timeout, SYN blackhole, etc.), the ConnectCallback is delayed for however long the OS takes to give up — in practice ~130+ seconds (default Linux TCP connect give-up).

Observed behavior

In a production voice-agent deployment (Python agents framework over this FFI) we observed two calls where:

  • The room WebSocket connected normally — the agent participant was active server-side within ~0.4s of job dispatch.
  • The FFI did not deliver ConnectCallback for ~137s. The agents framework logged The room connection was not established within 10 seconds after calling job_entry, and ctx.connect() appeared hung.
  • Because the agent session couldn't start, the agent never subscribed/published; the SIP caller heard ~128s of ringing and the carrier CANCELled the call.
  • When on_load finally failed, the FFI logged audio filter cannot be enabled: LiveKit Cloud is required (older message text; now "ensure you are connecting to LiveKit Cloud...") and only then reported connected — ~10s after the room had already been torn down.

Note the irony: filter init failure is deliberately non-fatal ("Skip returning an error here to keep the rtc session alive"), but a slow failure is effectively fatal to the session anyway, and misleadingly presents as a room-connection problem rather than a plugin problem.

Proposed fix

  • Wrap the on_load loop in a bounded timeout (a few seconds, e.g. 5s, or configurable via ConnectRequest). On timeout: log a warning, skip enabling the filter, and proceed — matching the existing non-fatal error handling.
  • Alternatively/additionally: send ConnectCallback first and initialize filters concurrently, marking the filter unavailable if init fails.
  • Consider logging the elapsed time on filter-init failure to make this failure mode diagnosable.

Environment

  • livekit-ffi via livekit-agents (Python), agent worker on Linux x86_64
  • Plugin: livekit-plugins-noise-cancellation (BVC)
  • Reproduces whenever the cloud HTTPS endpoint is unreachable-but-blackholed during connect (resolvable by e.g. dropping outbound 443 to the edge after the WS is established — the WS connects via a fallback path, on_load then stalls on the primary hostname)

See thread show down stream effect, although not root cause here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions