Description
In FfiRoom::connect, after Room::connect() succeeds, registered audio-filter plugins are initialized via filter.on_load(&req.url, &req.token) in a spawn_blocking task, and the ConnectCallback is not sent to the FFI client until that completes:
|
// initialize audio filters |
|
let result = server |
|
.async_runtime |
|
.spawn_blocking(move || { |
|
for filter in registered_audio_filter_plugins().into_iter() { |
|
filter.on_load(&req.url, &req.token).map_err(|e| e.to_string())?; |
|
} |
|
Ok::<(), String>(()) |
|
}) |
|
.await |
|
.map_err(|e| e.to_string()); |
|
match result { |
|
Err(e) | Ok(Err(e)) => { |
|
log::warn!("error while initializing audio filter: {}", e); |
|
log::error!( |
|
"audio filter cannot be enabled: ensure you are connecting to LiveKit Cloud and that the filter is properly configured" |
|
); |
|
// Skip returning an error here to keep the rtc session alive |
|
// But in this case, the filter isn't enabled in the session. |
|
} |
|
Ok(Ok(_)) => (), |
|
}; |
on_load for the noise-cancellation plugin performs a blocking HTTPS request to the LiveKit Cloud endpoint. There is no timeout around the plugin call. If that request stalls (DNS resolver timeout, SYN blackhole, etc.), the ConnectCallback is delayed for however long the OS takes to give up — in practice ~130+ seconds (default Linux TCP connect give-up).
Observed behavior
In a production voice-agent deployment (Python agents framework over this FFI) we observed two calls where:
- The room WebSocket connected normally — the agent participant was active server-side within ~0.4s of job dispatch.
- The FFI did not deliver
ConnectCallback for ~137s. The agents framework logged The room connection was not established within 10 seconds after calling job_entry, and ctx.connect() appeared hung.
- Because the agent session couldn't start, the agent never subscribed/published; the SIP caller heard ~128s of ringing and the carrier CANCELled the call.
- When
on_load finally failed, the FFI logged audio filter cannot be enabled: LiveKit Cloud is required (older message text; now "ensure you are connecting to LiveKit Cloud...") and only then reported connected — ~10s after the room had already been torn down.
Note the irony: filter init failure is deliberately non-fatal ("Skip returning an error here to keep the rtc session alive"), but a slow failure is effectively fatal to the session anyway, and misleadingly presents as a room-connection problem rather than a plugin problem.
Proposed fix
- Wrap the
on_load loop in a bounded timeout (a few seconds, e.g. 5s, or configurable via ConnectRequest). On timeout: log a warning, skip enabling the filter, and proceed — matching the existing non-fatal error handling.
- Alternatively/additionally: send
ConnectCallback first and initialize filters concurrently, marking the filter unavailable if init fails.
- Consider logging the elapsed time on filter-init failure to make this failure mode diagnosable.
Environment
livekit-ffi via livekit-agents (Python), agent worker on Linux x86_64
- Plugin:
livekit-plugins-noise-cancellation (BVC)
- Reproduces whenever the cloud HTTPS endpoint is unreachable-but-blackholed during connect (resolvable by e.g. dropping outbound 443 to the edge after the WS is established — the WS connects via a fallback path,
on_load then stalls on the primary hostname)
See thread show down stream effect, although not root cause here
Description
In
FfiRoom::connect, afterRoom::connect()succeeds, registered audio-filter plugins are initialized viafilter.on_load(&req.url, &req.token)in aspawn_blockingtask, and theConnectCallbackis not sent to the FFI client until that completes:rust-sdks/livekit-ffi/src/server/room.rs
Lines 151 to 172 in f5e85ed
on_loadfor the noise-cancellation plugin performs a blocking HTTPS request to the LiveKit Cloud endpoint. There is no timeout around the plugin call. If that request stalls (DNS resolver timeout, SYN blackhole, etc.), theConnectCallbackis delayed for however long the OS takes to give up — in practice ~130+ seconds (default Linux TCP connect give-up).Observed behavior
In a production voice-agent deployment (Python agents framework over this FFI) we observed two calls where:
ConnectCallbackfor ~137s. The agents framework loggedThe room connection was not established within 10 seconds after calling job_entry, andctx.connect()appeared hung.on_loadfinally failed, the FFI loggedaudio filter cannot be enabled: LiveKit Cloud is required(older message text; now "ensure you are connecting to LiveKit Cloud...") and only then reported connected — ~10s after the room had already been torn down.Note the irony: filter init failure is deliberately non-fatal ("Skip returning an error here to keep the rtc session alive"), but a slow failure is effectively fatal to the session anyway, and misleadingly presents as a room-connection problem rather than a plugin problem.
Proposed fix
on_loadloop in a bounded timeout (a few seconds, e.g. 5s, or configurable viaConnectRequest). On timeout: log a warning, skip enabling the filter, and proceed — matching the existing non-fatal error handling.ConnectCallbackfirst and initialize filters concurrently, marking the filter unavailable if init fails.Environment
livekit-ffivialivekit-agents(Python), agent worker on Linux x86_64livekit-plugins-noise-cancellation(BVC)on_loadthen stalls on the primary hostname)See thread show down stream effect, although not root cause here