merging v1.83.14-stable#8
Open
Raman-12321 wants to merge 7570 commits into
Open
Conversation
…mary actually renders under xdist `_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate` — which runs in each xdist worker process. The controller's `pytest_terminal_summary` then reads its own empty `_session_stats` and bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections the rest of this PR adds never make it into CI logs in the dist mode CI actually uses. Ship a structured `vcr_outcome` payload via `user_properties` (which xdist round-trips) and add `aggregate_report_outcome` on the controller to fold worker outcomes into `_session_stats`. The recording process tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can tell "single-process — already counted locally" apart from "produced by a worker — needs aggregation here", and not double-count when there's no xdist. Covered by 9 new unit tests in test_vcr_classification.py including the end-to-end summary render path.
…itellm_vcr-cache-observability-and-fixes-c5bc
…crypted_content (#27820)
…th when flag set (#27716) * feat(proxy): skip disable_background_health_check models on GET /health when flag set Co-authored-by: Cursor <cursoragent@cursor.com> * fix comment * fix greptile comments * Fix health check fallback kwargs * Format health endpoint * Harden direct health check kwargs compatibility for monkeypatched perform_health_check Replace substring-based TypeError detection with unexpected-keyword checks and a short retry chain (full kwargs, instrumentation only, filter only, minimal) so partial stubs work regardless of which optional kwarg fails first. Add proxy unit tests for legacy three-arg stubs and single-kwarg variants. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix black --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>
…ocks (#27850) * fix(bedrock-converse): drop blank-text fallback for empty thinking blocks Claude Code with extended thinking replays prior assistant turns that include an empty thinking block (`thinking=""`, `signature=""`) alongside tool_use blocks. The unsigned-reasoning fallback in `add_thinking_blocks_to_assistant_content` was emitting `BedrockContentBlock(text="")`, which Bedrock Converse rejects with: "The text field in the ContentBlock object at messages.X.content.0 is blank." Guard the fallback with a strip() check, matching the existing empty-text guards elsewhere in `_bedrock_converse_messages_pt`. * style: remove unneeded comments
…lidate LiteLLM_JWTAuth.__init__ calls get_instance_fn(custom_validate) without config_file_path, so an operator who configures custom_validate: s3://bucket/module.fn in their YAML JWT auth section would hit the runtime gate on startup and break their deployment. Accept config_file_path as a non-field kwarg (popped before the invalid-keys check), thread it into get_instance_fn, and pass it from the startup-load callsite via the existing user_config_file_path module-level path. Admin-API JWT config writes leave the kwarg at None and still hit the gate.
* fix(mcp): surface upstream 401 for token-forwarding MCP servers For MCP servers configured with extra_headers: [Authorization], the gateway forwards the client token directly to the upstream. When that token is rejected (expired or invalid) the upstream returns 401, but the MCP SDK starts the SSE stream with 200 OK before calling handlers, so the 401 can't be returned mid-stream. Fix: add a pre-flight httpx probe in handle_streamable_http_mcp — before the SDK opens the session — so the gateway can still return HTTP 401 with WWW-Authenticate: Bearer authorization_uri=<gateway-discovery-url> when the upstream rejects the token. The probe fails-open (returns 200) on network errors so a transient hiccup does not block valid requests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): parallelize pre-flight auth probes and use HEAD to avoid side effects - Extract forwarded_auth outside the pass-through server loop (was called N times for the same scope value) - Gather all upstream auth probes concurrently with asyncio.gather instead of sequentially; eliminates N×5 s worst-case latency - Switch probe from POST+initialize JSON-RPC body to HEAD request; HEAD carries the Authorization header so the upstream rejects invalid tokens with 401 but never allocates a session or writes an audit entry Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): use get_async_httpx_client in _probe_upstream_auth Replaces bare httpx.AsyncClient with the project-standard get_async_httpx_client(httpxSpecialProvider.MCP) to satisfy the ensure_async_clients_test code coverage check and avoid the +500 ms per-request overhead of creating a new client on every probe call. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(mcp): extract pre-flight probe into _check_passthrough_upstream_auth Moves the parallel upstream auth probe logic out of handle_streamable_http_mcp into a dedicated helper to satisfy Ruff PLR0915 (Too many statements > 50). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): gate pre-flight probes on authorized server set to prevent bypass _check_passthrough_upstream_auth was resolving user-supplied server names directly before authorization ran, letting any permitted LiteLLM key trigger an upstream HEAD probe to a server it was not allowed to use. Changes: - Call _get_allowed_mcp_servers inside the helper so only servers the caller's key is authorized for are probed. - Move the call site to after toolset scoping so the auth context is fully resolved before the probe list is built. - Thread user_api_key_auth into the helper signature (replaces the raw mcp_servers name list). Co-authored-by: Cursor <cursoragent@cursor.com> * Add async HTTP HEAD support Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): use Scope type annotation in _get_forwarded_auth_from_scope Co-authored-by: Cursor <cursoragent@cursor.com> * Fix MCP upstream auth probe method Co-authored-by: Yassin Kortam <yassin@berri.ai> * Remove unused AsyncHTTPHandler head method Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): exclude has_client_credentials servers from pre-flight auth probe _prepare_mcp_server_headers skips caller Authorization when the server uses OAuth client-credentials (M2M), but the pre-flight probe was still selecting those servers and forwarding the caller's raw token in the HEAD request. Exclude servers with has_client_credentials from the probe list to match the actual downstream header-preparation logic. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): propagate upstream 403 as 403, not 401 with WWW-Authenticate Per RFC 9110, 401 means "go get new credentials." Mapping an upstream 403 to a gateway 401 causes OAuth clients to restart the authorization flow, obtain a fresh token with identical scopes, hit 403 again, and loop indefinitely. 401 from upstream → gateway 401 + WWW-Authenticate (re-authorize) 403 from upstream → gateway 403 (no WWW-Authenticate hint) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): skip auth probe when Authorization may be the LiteLLM proxy key The pre-flight upstream probe must not forward the caller's Authorization header when it could itself be the LiteLLM proxy API key. Restrict the probe to requests that supply x-litellm-api-key explicitly — only then is the Authorization header unambiguously the upstream OAuth token the caller wants forwarded. * Fix MCP ASGI HTTPException propagation Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): use public AsyncHTTPHandler.post() in auth probe Use AsyncHTTPHandler.post() and catch httpx.HTTPStatusError explicitly so the 401/403 we want to surface is not silently swallowed by the broad fail-open except Exception block. Avoids reaching into the handler's private client attribute, which would silently regress to fail-open if AsyncHTTPHandler is ever refactored. * Fix MCP auth probe tests Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(mcp): add coverage for httpx.HTTPStatusError path in auth probe AsyncHTTPHandler.post() calls raise_for_status() internally, so a real upstream 401/403 lands as httpx.HTTPStatusError. Add a test that exercises that specific exception path so a regression that swallows the error in the broad fail-open except Exception would be caught. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: claude-bot <claude-bot@anthropic.com>
…timodal pricing (#27848) * fix(cost): align vertex_ai/gemini-embedding-2-preview with Vertex multimodal pricing Co-authored-by: Cursor <cursoragent@cursor.com> * fix(cost): align vertex_ai/gemini-embedding-2 GA source URL with preview Per Greptile review on #27848: GA entry referenced ai.google.dev while the preview entry was updated to the canonical Vertex AI pricing page. Both share identical pricing values; sync the source URL for consistency. https://claude.ai/code/session_01W8jRwstnmduadGw8Z8egxe --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <noreply@anthropic.com>
…27834) * feat(mcp): add delegate_auth_to_upstream flag for PKCE passthrough Adds an opt-in per-server flag that lets clients (e.g. VS Code) complete PKCE directly with an upstream OAuth2 MCP server, instead of LiteLLM double-gating with its own API-key/SSO check. Only honored when auth_type=oauth2 and the operator explicitly sets the flag; mixed-target or non-oauth2 requests fail closed. - Adds the field to Pydantic models, Prisma schema, and a migration - New MCPRequestHandler._target_servers_delegate_auth_to_upstream gate that runs only when no x-litellm-api-key is present, so authenticated users still get user_id resolution + stored-credential lookup - Anonymous callers now see delegate servers in get_allowed_mcp_servers (scoped to delegate servers only; the upstream still enforces auth) - mcp_management_endpoints: allow anonymous /authorize and /token for delegate servers so VS Code can complete PKCE without a LiteLLM session - UI toggle (shown only for oauth2) + payload/view wiring - Tests covering: oauth2 on/off, non-oauth2 with flag, mixed targets, no resolvable target, explicit key precedence, and 401 emission Co-authored-by: Cursor <cursoragent@cursor.com> * Enforce oauth2 for delegated MCP auth bypass Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): close secondary Authorization bypass for delegate servers The delegate-auth bypass gated only on the primary `x-litellm-api-key` header, so a LiteLLM key sent via `Authorization: Bearer sk-...` (the secondary header) was silently dropped — skipping spend tracking and rate limiting. Gate on the resolved litellm_api_key (which considers both headers) so the bypass fires only when neither is present. Also update the existing "Authorization header present" test to reflect that an upstream OAuth token now flows through the existing oauth2 fallback (LiteLLM auth attempt → fail → anonymous), not via the delegate branch. Co-authored-by: Cursor <cursoragent@cursor.com> * Avoid duplicate MCP OAuth credential lookup Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): block delegate bypass for M2M and internal-only servers Two security issues flagged in code review: 1. High – client_credentials (M2M) servers must not be delegatable: LiteLLM auto-fetches the upstream token using stored credentials, so allowing anonymous bypass would let any external caller invoke tools authenticated as LiteLLM's service account. Fix: check `server.has_client_credentials` in `_target_servers_delegate_auth_to_upstream`, the anonymous allow-list in `get_allowed_mcp_servers`, and `_mcp_oauth_user_api_key_auth`. 2. Medium – internal-only servers exposed to public internet: The anonymous delegate allow-list was not filtering by `available_on_public_internet`, so external callers with an upstream OAuth token could invoke tools on servers marked internal-only. Fix: add `available_on_public_internet` guard to the anonymous delegate server list in `get_allowed_mcp_servers`. Tests added for both cases. Co-authored-by: Cursor <cursoragent@cursor.com> * Require public MCP delegate auth servers Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): align delegate auth path parsing with downstream routing `_extract_target_server_names_from_path` used a naive segments-based split while `server.py::_get_mcp_servers_in_path` uses a regex that allows server names with one embedded slash and comma-separated lists. With the old parser, a request to `/mcp/<delegated>/<garbage>` was parsed as targeting `<delegated>` by the auth gate (bypassing LiteLLM auth) while the routing layer parsed it as `<delegated>/<garbage>` — when that name did not resolve, the request fell back to the anonymous allow-list, which can include `allow_all_keys` servers that normally require a LiteLLM key. Replace the parser with the same regex logic as `_get_mcp_servers_in_path` so auth gating sees the exact target name(s) downstream routing sees. Add regression tests covering parser parity and the specific extra-path-segment bypass attempt. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(mcp): close header/path TOCTOU in MCP delegate auth gate `_target_servers_delegate_auth_to_upstream` and `_target_servers_use_oauth2` trusted the `x-mcp-servers` header when present, but `server.py::extract_mcp_auth_context` overrides that header with the path-derived list for `/mcp/...` routes. An attacker could set `x-mcp-servers: <delegated>` while pointing the URL path at a non-delegate server, flipping the auth gate without changing the target downstream routing actually uses. Extract a shared `_resolve_target_server_names` helper that mirrors the downstream override (path-derived names for `/mcp/...` routes, header value otherwise). Add regression tests covering the TOCTOU attempt and the helper's path-vs-header precedence. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix delegated MCP OAuth test mock Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): drop unreachable /{server}/mcp branch in auth path parser `_extract_target_server_names_from_path` also matched the ``/{server_name}/mcp`` form, but the downstream parser ``_get_mcp_servers_in_path`` only handles ``/mcp/...`` — and ``dynamic_mcp_route`` in ``proxy_server`` rewrites ``/{name}/mcp`` to ``/mcp/{name}`` on the scope before the MCP handler runs. Parsing the un-rewritten form on the auth side was therefore unreachable in production, and contradicted the docstring's claim of mirroring the downstream parser — exactly the kind of mismatch that risks a future header/path TOCTOU if any new entry point skips the rewrite. Drop the branch; the canonical ``/mcp/...`` path matches both parsers. Update the regression test to assert the new behavior. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix MCP path auth target resolution Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp): require auth for refresh_token grants on delegate-auth servers `_mcp_oauth_user_api_key_auth` gates the unauthenticated PKCE flow for ``delegate_auth_to_upstream`` servers, but the bypass applied to BOTH ``/authorize`` and ``/token`` regardless of grant type. ``mcp_token`` accepts ``grant_type=refresh_token`` as well as ``authorization_code``, and ``exchange_token_with_server`` attaches the server's stored ``client_secret`` to whatever is forwarded upstream. An unauthenticated caller holding a refresh token issued to that OAuth client could mint fresh upstream access tokens through LiteLLM. Limit the anonymous bypass on ``/token`` to ``grant_type=authorization_code`` (the only grant PKCE actually protects via ``code_verifier``); fall through to normal LiteLLM auth for ``refresh_token`` and any other grant. ``/authorize`` continues to allow anonymous PKCE redirects. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(ui): clear delegate_auth_to_upstream when switching off oauth2 The ``delegate_auth_to_upstream`` form field is rendered inside an ``isOAuth2 && (...)`` conditional, so the Form.Item unmounts when the user changes ``auth_type`` away from ``oauth2``. The follow-up ``form.setFieldValue("delegate_auth_to_upstream", false)`` runs after the field has already deregistered, so ``onFinish`` receives ``undefined`` and the fallback ``?? mcpServer.delegate_auth_to_upstream`` preserved the old ``true``. The flag then persisted in the database for a non-oauth2 server and silently re-activated if ``auth_type`` was later switched back to ``oauth2``. In the edit payload, force the flag to ``false`` whenever ``auth_type !== oauth2``; only trust the form value (and the existing DB fallback) when the server is actually oauth2. Backend defense-in-depth already ignores the flag for non-oauth2 servers, but the DB state should stay clean too. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * Fix MCP delegate auth reset on edit Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <claude@anthropic.com>
…etion transformation (#27727) * fix(responses): preserve cache_control in Responses API -> Chat Completion transformation cache_control injected by AnthropicCacheControlHook was silently dropped when _transform_responses_api_content_to_chat_completion_content rebuilt content blocks with only {type, text}. Now copies cache_control through so Anthropic prompt caching works correctly when using client.responses.create with cache_control_injection_points. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(responses): preserve cache_control for input_image and input_file blocks Extends the cache_control fix to image and file content blocks, which were also silently dropping cache_control during the Responses API -> Chat Completion transformation. Adds tests for all three content block types. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Babysitter <claude@anthropic.com>
External readiness probes consumed the legacy detailed payload's `db`
field to drive alerting and pod-rotation decisions. Stripping the body
to `{"status": "healthy"}` broke those probes silently — the HTTP code
still flipped to 503, but probes checking `body.db == "connected"`
treated the response as healthy.
Add `db` back to the unauthenticated payload. Keep the rest of the
diagnostic fields (litellm_version, callbacks, cache, log_level) gated
behind /health/readiness/details so the recon-leak gate from #26912
holds. Values match the legacy contract: "connected", "disconnected",
"Not connected".
fix(proxy): expose db status on public /health/readiness
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Document the purpose of the daemon thread that backs the sync branch of the timeout decorator. Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
#26302) * fix: Fix Redis Sentinel client handling to solve authentication error with password protected sentinel (#25625) * fix Redis Sentinel authentication handling * test: cover Redis Sentinel auth routing * refactor: align Redis Sentinel kwargs threading * fix: avoid duplicate Redis Sentinel socket timeouts * Address review comments * refactor(_redis): return set from _get_redis_kwargs for O(1) lookup Align _get_redis_kwargs() with the cluster helper by returning a set instead of a list, so the sentinel connection-kwargs filter uses O(1) membership tests. Addresses Greptile review feedback on PR #26302. * fix(_redis): restore Azure-specific kwargs in cluster kwargs set The set-literal refactor of _get_redis_cluster_kwargs dropped four LiteLLM-custom Azure keys (azure_redis_ad_token, azure_client_id, azure_tenant_id, azure_client_secret) that the prior list form had explicitly appended. Because they are not in RedisCluster's argspec, they were silently stripped, breaking Azure IAM auth on cluster clients. Re-add them to the explicit include set. --------- Co-authored-by: Kristin Cowalcijk <kristincowalcijk@gmail.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: krrish-berri-2 <krrish-berri-2@users.noreply.github.com> Co-authored-by: claude <claude@anthropic.com>
* fix(ollama): Include provider in model list for ollama (#26135) * Include provider in model names for ollama * Fix unit tests * fix(ollama): process both thinking and content in same streaming chunk (#26098) * fix(health_check): skip max_tokens for image_generation mode (#26417) * fix(health_check): skip max_tokens for image_generation mode `_update_litellm_params_for_health_check` injected `max_tokens` for every deployment. OpenAI `/v1/images/generations` strictly rejects unknown fields, so health checks for dall-e-* and gpt-image-1 always failed with `400 "Unknown parameter: 'max_tokens'"` even though the actual image endpoint calls succeed. Skip the `max_tokens` injection when `model_info.mode == "image_generation"`. `messages` still gets injected (downstream `_filter_model_params` already strips it for non-chat handlers). * Switch to allow-list with per-deployment override Per @krrishdholakia review: deny-listing image_generation only re-introduces the same bug for every other non-chat mode (embedding, audio_*, rerank, video_generation, ocr, search, moderation, ...). Replace the single image_generation skip with `_MAX_TOKEN_SUPPORT_MODES = {chat, completion, responses}`. Missing `mode` is treated as chat for backward compatibility. New modes are safe by default. Add `model_info.health_check_supports_max_tokens` as an operator escape hatch — True forces injection on a non-listed deployment (operator wants to bound probe tokens), False suppresses it on a chat-style deployment behind a strict-schema provider. Tests: parametrize over 3 chat-style + 10 non-chat modes, plus override on/off and the no-mode legacy path. * fix(http_handler): handle RequestNotRead in MaskedHTTPStatusError for multipart uploads (#26718) Squash-merged by litellm-agent from dawidkulpa's PR. * fix(ollama): guard against double 'ollama/' prefix in live model listing Greptile flagged that Ollama servers can return names that already start with 'ollama/'. Check the prefix before prepending so we don't produce 'ollama/ollama/...'. Adds a regression test. * Fix Ollama empty reasoning stream chunks Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: James Myatt <james@jamesmyatt.co.uk> Co-authored-by: VHash <225398745+vhash0@users.noreply.github.com> Co-authored-by: hayden <sewhan.kim+@a-bly.com> Co-authored-by: dawidkulpa <84176950+dawidkulpa@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
* fix: strip Gemini thought-signature from tool_use.id in non-streaming path; example websearch config (#27873) - adapters/transformation.py: mirror the streaming path and strip the `__thought__<b64>` suffix off `tool_call.id` before building the AnthropicResponseContentBlockToolUse. Base64's `+ / =` characters violate Anthropic's `^[a-zA-Z0-9_-]+$` tool_use.id pattern, so when a conversation that flowed through Gemini is later replayed to an Anthropic-native provider (Bedrock or Anthropic API) the request 400s. - example_config_yaml/websearch_interception_config.yaml: register the interceptor under `callbacks:` not `success_callback:`. `success_callback` does not run pre-request hooks, so the tool-conversion step never fires on `/v1/messages` and the raw `web_search_20250305` tool is forwarded to Bedrock, which 400s. - adds a unit test pinning the non-streaming strip behavior and the surviving `^[a-zA-Z0-9_-]+$` shape of the resulting id. Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> * Fix/azure image edit auth header (#27863) * fix(azure/image_edit): use api-key header instead of Authorization Bearer Delegate `AzureImageEditConfig.validate_environment` to `BaseAzureLLM._base_validate_azure_environment` so the image-edit route follows the same auth resolution as every other Azure provider: - prefer the Azure-native `api-key` header when an API key is available - fall back to `Authorization: Bearer <azure_ad_token>` only for AAD auth The previous implementation unconditionally set `Authorization: Bearer <api_key>`, which is the OpenAI-direct convention and is rejected by Azure OpenAI / APIM-fronted deployments with `401 Access denied due to missing subscription key`. Adds regression tests covering api_key kwarg, litellm_params.api_key, and the AAD-token fallback path. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(azure/image_edit): pin api-key precedence semantics + add regression test Address review feedback that the move to ``BaseAzureLLM._base_validate_azure_environment`` changed the relative priority of the positional ``api_key`` kwarg vs. ``litellm_params["api_key"]``. The new behavior — ``litellm_params["api_key"]`` wins, positional only fills in when ``litellm_params["api_key"]`` is empty — is intentional and matches every other Azure ``validate_environment``: ``AzureVideosConfig`` uses the exact same merge logic, while ``AzureVectorStoresConfig`` and ``AzureResponsesAPIConfig`` don't accept a positional ``api_key`` at all. The old ``or`` chain (positional wins) was the outlier and was part of the same OpenAI-vs-Azure convention drift that produced the original ``Authorization: Bearer`` bug. The only production caller (``llm_http_handler.image_edit``) sources both values from the same ``litellm_params.api_key``, so this change is behaviorally a no-op there. Document the precedence in the docstring and lock it in with an explicit test so future refactors can't quietly re-invert it. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: Adam Kirstein <adam.kirstein@disney.com> Co-authored-by: Cursor <cursoragent@cursor.com> * test(azure/image_edit): expect api-key header instead of Authorization Bearer PR #27863 fixed Azure image edit to use the Azure-native api-key header instead of OpenAI's Authorization: Bearer convention, but did not update test_azure_image_edit_litellm_sdk to match. The test still asserted 'Authorization' in headers, which now fails since the new code routes through BaseAzureLLM._base_validate_azure_environment and emits api-key when an api_key is provided. Update the assertion to pin the correct Azure behavior: api-key header present with the resolved key, and no Authorization header. --------- Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai> Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com> Co-authored-by: Adam Kirstein <107421694+justalittleadam@users.noreply.github.com> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: Adam Kirstein <adam.kirstein@disney.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
…Fireworks API call (#27881) * fix(fireworks_ai): strip thinking_blocks from chat messages before API call Fireworks OpenAI-compatible ChatMessage schema uses additionalProperties:false and rejects Anthropic-style messages[].thinking_blocks (e.g. Claude Code replays), returning invalid_request_error. Remove the field in _transform_messages_helper alongside provider_specific_fields. Adds unit test test_transform_messages_helper_strips_thinking_blocks. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(fireworks_ai): drop inline comments from message sanitization Co-authored-by: Cursor <cursoragent@cursor.com> * docs(fireworks_ai): explain why provider_specific_fields and thinking_blocks are stripped Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Authenticated clients could supply CustomPricingLiteLLMParams fields (input_cost_per_token, output_cost_per_token, etc.) in the request body. These were forwarded to register_model() in main.py, permanently mutating the shared global litellm.model_cost dict for all users on the instance. Adds all CustomPricingLiteLLMParams fields to _BANNED_REQUEST_BODY_PARAMS so is_request_body_safe() rejects them before they reach completion(). New pricing fields added to CustomPricingLiteLLMParams are auto-covered. Admin opt-in via allow_client_side_credentials or configurable_clientside_auth_params still works as before. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When ``ProxyConfig`` merges DB-persisted ``litellm_settings`` / ``general_settings`` on top of the YAML config, the merged dict is later iterated by ``load_config`` which threads ``config_file_path`` (the YAML path) into ``get_instance_fn``. The runtime gate that refuses ``s3://`` / ``gcs://`` modules when ``config_file_path`` is ``None`` therefore can't distinguish a YAML-sourced value from a DB-sourced one: both look the same to ``get_instance_fn``. Strip ``s3://`` / ``gcs://`` entries from the DB-overlay value for every field whose contents reach ``get_instance_fn`` during config load: - litellm_settings: ``callbacks``, ``success_callback``, ``failure_callback``, ``audit_log_callbacks``, ``post_call_rules``, ``custom_provider_map[].custom_handler`` - general_settings: ``custom_auth``, ``custom_key_generate``, ``custom_key_update``, ``custom_sso``, ``custom_ui_sso_sign_in_handler``, ``litellm_jwtauth.custom_validate`` The YAML config-file load path is unchanged — the documented operator flow (``callbacks: ["s3://bucket/module.instance"]`` in ``config.yaml``) still works. Only DB-overlay writes (e.g. via ``/config/update``) are stripped. Adds 16 regression tests covering the scrub matrix.
A pass-through endpoint's ``target`` field is passed through ``create_pass_through_route`` into ``get_instance_fn`` during config load. A PROXY_ADMIN persisting ``target: "s3://attacker/m.i"`` via the DB-overlay ``pass_through_endpoints`` write path was not covered by the previous scrub matrix, so the remote module load would still reach the loader because the YAML-load chain has ``config_file_path`` set. Walk each entry in ``general_settings.pass_through_endpoints`` and null out any ``target`` that starts with ``s3://`` or ``gcs://``. The entry itself is preserved so the path-registration helper can choose how to handle a missing target (the existing code skips the route when ``target is None``). Adds two regression tests.
…nd Vertex (#27705) * fix(prometheus): emit remaining_tokens/requests gauges for bedrock + vertex (LIT-2719) Bedrock and Vertex AI never return x-ratelimit-remaining-* response headers, so litellm_remaining_tokens_metric / litellm_remaining_requests_metric only fired for OpenAI / Azure / Anthropic deployments even when tpm/rpm was configured on the router. Add a provider-agnostic fallback in PrometheusLogger.async_log_success_event that asks Router.get_remaining_model_group_usage() for the same model_group and emits the gauges with configured_limit - current_usage when the upstream provider didn't populate the headers itself. Existing OpenAI / Azure / Anthropic flows are unchanged because the fallback short-circuits when both header values are already present. Tests: 8 new tests covering bedrock + vertex emission, header short-circuit, partial-header fill, llm_router=None, missing model_group, empty router result, and router exception swallowing. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): narrow except to ImportError, log router lookup failures via verbose_logger.exception Address greptile review: - The optional 'from litellm.proxy.proxy_server import llm_router' should guard against ImportError specifically, not all exceptions, so that unexpected errors (e.g. AttributeError from partially-initialized state) stay visible. - get_remaining_model_group_usage failures are now logged via verbose_logger.exception (with traceback) instead of debug, matching the PR description's intent and avoiding silent loss of router-cache errors in production. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): subtract in-flight delta in router-remaining fallback The router's TPM/RPM counter is incremented by Router.deployment_callback_on_success, which fires alongside this prometheus callback in the success-log fan-out. Prometheus wins the race, so get_remaining_model_group_usage returns the pre-decrement counter for the current request — while vendor headers (OpenAI/Anthropic/Azure) are already post-decrement. That broke parity between providers on the same gauge: dashboards plotting litellm_remaining_requests_metric showed Bedrock/Vertex perpetually one request behind Anthropic for the same throughput. Replay the in-flight increment before emit: subtract total_tokens from remaining_tokens and 1 from remaining_requests. * Revert "fix(prometheus): subtract in-flight delta in router-remaining fallback" This reverts commit 001ce95. * fix(router): post-decrement router-derived ratelimit headers Router.set_response_headers injects x-ratelimit-remaining-{tokens, requests} for providers that don't return them natively (Bedrock, Vertex). The values come from get_remaining_model_group_usage, which reads the router's TPM/RPM counter — incremented post-response by deployment_callback_on_success. So the headers reflected the counter state before the current request was counted: pre-decrement. Vendor headers from OpenAI/Anthropic/Azure are post-decrement (the vendor counted the request before responding). Same metric name, two semantics — dashboards plotting litellm_remaining_requests_metric showed Bedrock/Vertex perpetually one request behind for the same throughput, and the HTTP response headers exposed the same skew to clients. Subtract the in-flight delta before writing: 1 from remaining-requests, response.usage.total_tokens from remaining-tokens. Fixes both the response headers and (transitively) the prometheus gauges that read from standard_logging_payload.additional_headers. --------- Co-authored-by: cursor <cursor@example.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* Update gpt-4o-transcribe price * Update test for gpt-4o-transcribe pricing fix * Update gpt-4o-mini-transcribe price
aws_sts_endpoint, aws_web_identity_token, and aws_bedrock_runtime_endpoint in ingest_options.vector_store were passed directly to the Bedrock ingestion class, which reads them into boto3 STS client construction. Any authenticated caller could redirect AssumeRole calls to an attacker-controlled server, leaking the proxy's instance profile credentials. Calls is_request_body_safe() on ingest_options["vector_store"] before forwarding to litellm.aingest(). Same banned-params list and admin opt-in escape hatch (allow_client_side_credentials) as the /chat/completions path. ValueError from the safety check is caught and re-raised as HTTP 400. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…verlay
A guardrail entry's ``callbacks`` list (v1: ``{name: {callbacks:[...]}}``,
v2: ``{guardrail_name, litellm_params: {callbacks: [...], guardrail:
"module.path"}}``) is iterated during config load and threaded through
``get_instance_fn``. A PROXY_ADMIN persisting
``litellm_settings.guardrails[*].callbacks: ["s3://..."]`` or
``litellm_settings.guardrails[*].litellm_params.guardrail: "s3://..."``
via ``/config/update`` was not covered by the previous scrub matrix.
Walk both v1 and v2 entry shapes and null out remote-URL callbacks /
module-path values before the merge. Adds four regression tests.
…27726) * feat(mcp): support MCP access group names in URL-based namespacing Extends dynamic_mcp_route to resolve /{name}/mcp requests where {name} is an MCP access group tag or a comma-separated list of servers/groups, matching what the documentation promised but the handler did not implement. Resolution order: registered server alias → toolset → comma-separated list → single access group tag (404 if none match). Adds unit tests covering all four resolution paths plus 404 cases. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): address Greptile review comments on dynamic_mcp_route - Move comma-separated check before toolset DB lookup so comma names short-circuit without hitting the database - Cache access-group DB lookups via user_api_key_cache to avoid a raw find_many on every request (matches toolset caching pattern) - Remove unused response_started variable from _forward_as_mcp_path - Update tests to assert comma list skips toolset call and to mock cache Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(mcp): extract helpers to fix PLR0915 too-many-statements in dynamic_mcp_route Extract _mcp_forward_as_path and _is_mcp_access_group_cached as module-level helpers so dynamic_mcp_route stays under the 50-statement limit. Update tests to patch the new module-level symbols directly. Co-authored-by: Cursor <cursoragent@cursor.com> * Avoid caching missing MCP access groups * fix(mcp): stream MCP responses via _stream_mcp_asgi_response instead of buffering _mcp_forward_as_path previously accumulated the full response body in memory before sending it. Replace the buffering custom_send pattern with _stream_mcp_asgi_response, which uses an asyncio.Queue bridge so chunks are yielded to the client as they arrive, preventing unbounded memory growth on large or long-lived MCP responses. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(mcp): short-TTL negative cache for access-group existence lookup An unauthenticated caller could repeatedly request /<unknown>/mcp and force a fresh DB lookup for the access-group existence check on every request (only positive results were cached). Cache negative results for a short DEFAULT_MCP_ACCESS_GROUP_NEGATIVE_CACHE_TTL window (10s by default) so the DB is shielded from flooding while a transient DB error (which surfaces as an empty list) cannot hide a real group for long. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(mcp): use plain int for access-group negative cache TTL Drop the os.getenv wrapper around DEFAULT_MCP_ACCESS_GROUP_NEGATIVE_CACHE_TTL to avoid the documentation_test_env_keys check failing on the new variable. The negative-cache window is a small internal tuning constant, not a user-facing knob, so a plain integer is clearer than an env override. https://claude.ai/code/session_01SjyPmwfmrq8fveFgw9iHW9 * fix(mcp): validate, dedupe, and cap CSV tokens in dynamic MCP route For /{name1,name2,...}/mcp, validate every token resolves to a known server alias or access group, dedupe case-insensitively, and cap at DEFAULT_MCP_NAMESPACE_CSV_MAX_TOKENS=16 before forwarding. - Bounds the per-request DB / cache fan-out an authenticated caller can trigger by stuffing the path with tokens (raised by veria-ai). - Returns 404 instead of forwarding when no token resolves, so the downstream server filter cannot silently fall back to the full allowed_mcp_servers list (raised by Cursor agentic security review). - Forwards only the resolved subset, so unknown tokens cannot ride along into the downstream filter. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(mcp): exact-match CSV token dedupe to preserve case-sensitive distinct tokens Bugbot flagged that case-insensitive dedup on `MyGroup,mygroup` could collapse to whichever case appeared first and silently drop the matching casing if the downstream resolver is case-sensitive. Switch to exact-match dedup so distinct casings survive; whitespace-only differences still collapse via the .strip() before comparison. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: mateo-berri <mateo@berri.ai> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
``extra_body`` is the OpenAI-SDK passthrough container. Provider modules read provider-auth fields out of it directly (Azure's ``extra_body.azure_ad_token``, Bedrock's ``extra_body.aws_web_identity_token``, etc.) without re-validating, so the boundary check has to walk it the same way it walks ``litellm_embedding_config``. Adding it to ``_NESTED_CONFIG_KEYS`` extends single-level banned-key descent into the container — top-level admin opt-ins (``allow_client_side_credentials`` / ``configurable_clientside_auth_params``) still apply. ``azure_ad_token`` was not in ``_BANNED_REQUEST_BODY_PARAMS`` despite being the bearer-token field the Azure transformer resolves through ``get_secret`` (same shape as ``aws_web_identity_token`` on the Bedrock STS path). Added so it can't be supplied per-request without an admin opt-in.
…27896) * fix(ui): fetch version + debug flag from /health/readiness/details The proxy moved `litellm_version`, `is_detailed_debug`, and other diagnostic fields off the public `/health/readiness` payload behind an auth-gated `/health/readiness/details` endpoint. The navbar version tag and the detailed-debug-mode banner stopped working because they were still reading those fields from the unauthed response, which no longer contains them. Replace `useHealthReadiness` with a `useHealthReadinessDetails` hook that takes an `accessToken` argument and sends a Bearer header to the auth-gated endpoint. The hook stays disabled while `accessToken` is falsy, so the navbar can keep rendering on the public model hub (where the token is null) without triggering an auth redirect or a 401-loop. * fix(ui): disable retries on readiness/details + cover token forwarding Two small follow-ups on the readiness/details migration: - Set `retry: false` on the query. The payload feeds a passive navbar tag and a debug banner; a 401 from an expired token shouldn't fan out into three retries against the proxy. - Add navbar specs that assert the `accessToken` prop is forwarded into the hook (matches the DebugWarningBanner spec). Without this, the navbar could silently regress to passing `undefined` and the existing tests wouldn't catch it.
``_NESTED_CONFIG_KEYS`` descent used ``isinstance(nested, dict)``, so a caller sending ``extra_body`` as a JSON-encoded string instead of an object (the same shape multipart/form-data clients use for ``litellm_metadata``) skipped the banned-key check entirely. Switched to ``_coerce_metadata_to_dict`` so the JSON-string path is parsed before descent — mirrors the existing handling on ``_NESTED_METADATA_KEYS``.
``test_azure_ad_token_is_in_banned_list`` only asserted tuple membership of a name the parametrized test already exercises end-to-end through ``is_request_body_safe``. Removed. Tightened the admin-opt-in test comment.
…28848) * fix(realtime): send TEXT frames and valid guardrail session.update Decode backend recv bytes before send_text so clients receive OP_TEXT JSON events. Include turn_detection.type server_vad in injected session.update. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(realtime): skip non-UTF-8 backend binary frames Avoid terminating the forwarding loop on UnicodeDecodeError when the backend sends unexpected binary payloads. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
A key's unified access_group_ids now extend the team's MCP scope instead of being capped by it — mirrors the model-side union from LIT-2404. The group's assigned_team_ids / assigned_key_ids still gate the override, so team members can't pull in MCPs via a foreign team's group. Resolves LIT-3189
#28771) * fix(galileo): support hosted v2 spans API and string output extraction Use GALILEO_API_KEY with /v2/projects/{id}/spans for Galileo Cloud, keep legacy observe/ingest for username/password deployments, and extract assistant content as a string instead of a message dict. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): address review — async enterprise auth and message input Use async httpx for enterprise login to avoid blocking the event loop, preserve multi-turn messages in v2 span input, and clean up tests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): handle negative TZ offsets, 2xx success, and Pydantic ImageObject serialization Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(galileo): treat any 2xx ingest response as success Use response.is_success so 201 Created clears in_memory_records and avoids duplicate span submissions on subsequent flushes. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): cast message dict for mypy in convert_content_list_to_str Co-authored-by: Cursor <cursoragent@cursor.com> * merge main (#28835) * fix(proxy): Bedrock Knowledge Base pass-through: preserve SigV4 headers and signed request body (#27526) * Fix Bedrock KB pass-through SigV4 headers and signed body Coerce botocore HeadersDict to a dict for pass-through routes. When forward_headers is true, drop request headers that collide case-insensitively with signed headers so client Bearer auth does not shadow AWS SigV4. Send prepped.body as raw content so the outbound payload matches the signature after logging hooks mutate the parsed dict. Co-authored-by: Cursor <cursoragent@cursor.com> * Simplify pass-through raw body handling Read the SigV4-signed bytes directly from request.state inside pass_through_request instead of threading a custom_raw_body argument through three functions. Helper methods are restored to their original signatures, and the new branch lives in one place at each httpx call site. Co-authored-by: Cursor <cursoragent@cursor.com> * Harden pass-through raw body read from request.state Guard missing request.state (test fixtures) and ignore non-bytes/str values so MagicMock does not trigger the SigV4 raw-body path. Co-authored-by: Cursor <cursoragent@cursor.com> * Test pass_through_request state_raw_body uses httpx content= Cover non-streaming (async_client.request) and streaming (build_request) paths so SigV4 bytes on request.state are not replaced by json= of a hook-mutated dict. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * chore(tests): migrate Bedrock CI to AWS account 941277531214 (#28728) * chore(tests): migrate Bedrock CI from AWS account 888602223428 to 941277531214 The original account (888602223428) was put under a security restriction by AWS after a root access key leaked in a PR comment. While that account works its way through the AWS Support unlock process, Bedrock-touching CI tests have been migrated to a fresh account (941277531214). Changes: - Replace 26 hardcoded references to 888602223428 with 941277531214 across 8 files (provisioned-model ARNs, imported-model ARNs, AgentCore runtime ARNs, batch execution role ARN, and example proxy config). - The provisioned-model and imported-model ARNs are referenced only from mocked unit tests — no AWS resources to recreate. - The batch execution IAM role has been recreated in the new account with the same name and equivalent permissions. - The two AgentCore runtimes (hosted_agent_r9jvp-3ySZuRHjLC, hosted_agent_13sf6-cALnp38iZD) are being recreated in the new account under the same names — see tools/agentcore-deploy/ in a follow-up. CircleCI env vars AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION_NAME were updated separately via the CircleCI API to point at the new account. Smoke-tested locally against the new account: aws bedrock-runtime converse --region us-west-2 \ --model-id us.anthropic.claude-sonnet-4-5-20250929-v1:0 \ --messages '[{"role":"user","content":[{"text":"ping"}]}]' → 200, model returned 'pong' Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(tests): refresh AgentCore ARN suffixes to match newly-deployed runtimes The first migration commit replaced just the account ID, but AgentCore auto-assigns a random 10-char suffix to every runtime on creation — we can't reuse the original suffixes (`3ySZuRHjLC`, `cALnp38iZD`) in the new account. Updated the AgentCore-runtime ARNs in the three files that reference real runtime IDs (not the mock-based unit-test ARNs). Deployed runtimes: arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_r9jvp-Rq79QFC2fp arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_13sf6-4046UzHSwy Both runtimes are status=READY and pass a smoke invoke: $ aws bedrock-agentcore invoke-agent-runtime --agent-runtime-arn ... --payload '{"prompt":"ping"}' → 200, {"result": "echo: ping"} The agent is a minimal echo (see /tmp/agentcore_deploy/agent.py for the deploy artifacts). Tests that only verify the SDK wiring will pass; if any test asserts on agent output content, swap the echo for the real agent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(tests): point Bedrock batch tests at new-account S3 bucket The account migration (888602223428 -> 941277531214) was a flat account-ID swap, which only rewrites ARNs that embed the account number. S3 bucket names carry no account ID, so the live Bedrock batch tests still uploaded to `litellm-proxy` — a bucket that lives in the old account. S3 names are globally unique, and the old account still holds that name, so it can't be recreated in the new account. Rename to `litellm-proxy-941277531214` (account-ID suffix guarantees global uniqueness). The bucket must be created in 941277531214 and the batch execution role granted s3:GetObject/PutObject/ListBucket on it before this job is run in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): point live S3 logging test at new-account bucket Same account-ID-free blind spot as the batch bucket: `load-testing-oct` lives in the old account and its name can't be reused globally. The `logging_testing` CI job is wired into the workflow and runs test_basic_s3_logging, which uploads to this bucket with the CI env creds, then lists and deletes objects — a live dependency. Rename to `load-testing-oct-941277531214`. The bucket must exist in the new account with the CI IAM principal granted s3:PutObject/GetObject/ListBucket/DeleteObject before this job runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): repoint Bedrock guardrail IDs to new-account guardrails The migration left guardrail IDs untouched (no account ID in them), so all live guardrail tests failed with "guardrail identifier or version does not exist" against 941277531214. Recreated both guardrails in the new account and updated the hardcoded IDs: - wf0hkdb5x07f -> zgkmukebruil (PII mask: PHONE + CREDIT_DEBIT_CARD, with explicit inputAction=ANONYMIZE so masking applies to INPUT, which is the source litellm's moderation hook sends) - ff6ujrregl1q -> 4w3d1di3snt5 (blocks "coffee"; blocked message set to the exact string the tests assert on) Updated test_bedrock_guardrails.py, otel_test_config.yaml, and the guardrailConfig in test_bedrock_completion.py. Verified locally: the 5 previously-failing guardrail tests now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): migrate legacy models to current inference profiles The new CI account (941277531214) cannot invoke legacy Bedrock models (AWS gates them: "marked by provider as Legacy... not actively using in the last 30 days"). Migrated the live-call tests: - anthropic.claude-3-sonnet-20240229 -> us.anthropic.claude-sonnet-4-5-20250929-v1:0 - anthropic.claude-3-haiku-20240307 -> us.anthropic.claude-haiku-4-5-20251001-v1:0 Current Claude models on Bedrock require the us. inference-profile prefix (bare on-demand ids are rejected). cohere.command-r-plus has no working replacement (all Cohere is legacy- gated in the new account): swapped to claude-haiku-4-5 in provider- agnostic param lists. amazon.titan-image-generator skipped (no working replacement). Mocked/transformation/cost tests that reference the legacy strings are intentionally left unchanged. Verified live against the new account. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): repoint SageMaker + Knowledge Base to new-account resources These referenced account-scoped resources by hardcoded id that only existed in the old account, so the migration's account-ID swap missed them. Recreated in 941277531214 and repointed: - SageMaker endpoint jumpstart-dft-hf-textgeneration1-mp-20240815-185614 -> litellm-ci-textgen (gpt2 on a TGI container, ml.g5.xlarge) - Bedrock Knowledge Base T37J8R4WTM -> LCYXFBR2TU (OpenSearch Serverless vector store + titan-embed-text-v2, seeded with a LiteLLM doc) Verified live: test_sagemaker.py (12 passed) and test_bedrock_knowledgebase_hook.py (12 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(reasoning_effort_grid): skip bedrock claude-opus-4-7 cells (not entitled on 941277531214) claude-opus-4-7 is listed in the new Bedrock CI account's foundation models but invoke is denied (AccessDeniedException: "not available for this account"). Bedrock access to the flagship Opus requires an AWS Sales request, not the self-serve model-access toggle, so it can't be enabled inline with the rest of the account migration. Add an optional `skip_reason` to ModelEntry and set it on the bedrock-claude-opus-4-7 entry; the grid test honors it via pytest.skip. Cell count (231) and route coverage are unchanged, so the structural asserts still pass. Restore coverage by deleting the one skip_reason line once access is granted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): swap/skip legacy-gated models unavailable on new CI account The migrated AWS account (941277531214) cannot access several models that the old account could, so the remaining red CI jobs were hitting real Bedrock "Access denied / Legacy" and "account not authorized" errors: - image_gen: skip both Nova Canvas test classes (amazon.nova-canvas-v1:0 is legacy-gated), matching the existing titan skip. - batches: skip test_async_file_and_batch (Bedrock batch inference is not authorized on the new account; requires an AWS support case). - litellm_overhead: swap legacy claude-3-5-haiku for the active us.anthropic.claude-haiku-4-5 inference profile. - test_completion_claude_3_function_call: swap legacy claude-3-sonnet for the active us.anthropic.claude-sonnet-4-5 inference profile. https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa * test(bedrock): fix remaining e2e legacy-model + batch failures on new CI account - e2e_openai_endpoints: skip test_bedrock_batches_api (Bedrock batch inference is not authorized on account 941277531214) and migrate the missed s3_bucket_name in oai_misc_config.yaml to litellm-proxy-941277531214. - build_and_test: swap legacy bedrock claude-3-sonnet for the active us.anthropic.claude-sonnet-4-5 inference profile in the proxy structured output e2e test. https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa * test(bedrock): make opus-4-7 + batch cells fail loudly and mock image-gen (#28791) Replace the silent skips added for the new CI account with noisier behavior: - reasoning-effort grid: opus-4-7 cells now fail (when AWS creds are present) instead of skipping, so the missing entitlement stays visible in CI; they still skip when AWS creds are absent (local dev) - Bedrock batch inference tests: drop the skip so they run and fail until batch access is granted - Titan + Nova Canvas image-gen tests: mock the Bedrock HTTP call so the transform + cost-tracking path stays under test without live model access https://claude.ai/code/session_01MT7SWDnXUjv6e6EPG7BDjT Co-authored-by: Claude <noreply@anthropic.com> * test(bedrock): use pytest.xfail for known-failing opus-4-7 cells Replace pytest.fail with pytest.xfail when a model has a fail_reason, so known-broken cells stay visible as XFAIL without keeping CI red. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(otel): export SERVER span on management-endpoint success without http_request (#28794) Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MacBook-Pro.local> * chore(ci): merge dev branch (#28801) * chore(proxy): route path-dependent call sites through get_request_route Replace direct ``request.url.path`` reads in auth, ACL, routing, and audit-log decisions with ``get_request_route(request)`` — the helper already added in ``auth/auth_utils.py`` that returns the ASGI ``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs ``url.path`` from the Host header; ``scope["path"]`` is uvicorn's parse of the request line and matches what FastAPI dispatches on, so it's the authoritative route for any decision that should agree with the actual handler. Sites: - _experimental/mcp_server/auth/user_api_key_auth_mcp.py - management_endpoints/mcp_management_endpoints.py - vector_store_endpoints/utils.py - pass_through_endpoints/pass_through_endpoints.py - auth/route_checks.py - litellm_pre_call_utils.py - spend_tracking/spend_management_endpoints.py - common_utils/http_parsing_utils.py - management_helpers/utils.py - health_endpoints/_health_endpoints.py Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py that construct a Request with scope["path"] set to a benign route and the Host header crafted so url.path would resolve differently; each site's decision is asserted against scope["path"]. * chore(proxy): make get_request_route imports lazy at call sites Move the ``from litellm.proxy.auth.auth_utils import get_request_route`` imports added in the prior commit back to the function bodies that use them. The module-level form participates in a long-standing import cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL on the PR; the lazy form matches the pattern the proxy already uses for ``user_api_key_auth`` and related helpers elsewhere in these files. Also drop the ``RouteChecks._is_assistants_api_request`` delegation in ``_get_metadata_variable_name`` introduced in the prior commit — the delegation pulled ``RouteChecks`` into the same cycle, and the call site reuses the resolved route for its other branches, so inlining the substring check is both cycle-free and avoids a redundant second ``get_request_route`` call. Comment in test_proxy_routes.py acknowledges that the two MCP table entries exercise ``get_request_route`` directly rather than the full production handler (which needs ASGI scope + MCP state to invoke). --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> * chore(ci): merge dev branch (#28657) * feat(dashboard): navbar hierarchy + Agent Platform notifications (#27543) * feat(dashboard): refine navbar zones and Agent Platform notice Restructure the admin navbar for production users: clear product vs community vs personal columns with vertical dividers, icon-only Slack/GitHub in a shared chip, and Docs/Blog typography aligned on an 8px rhythm. Add a notifications bell with popover linking to the LiteLLM Agent Platform repo and optional mark-as-read persistence. Promote the account control with initials avatar, single-line display name, and navDisplayName mapping for placeholder user ids (e.g. default_user_id). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(dashboard): address PR review — AntD buttons, public page guard, dedupe regex - Replace raw <button> with AntD Button in BlogDropdown, NotificationsBell, UserDropdown, and test mock - Guard NotificationsBell + container behind !isPublicPage to avoid rendering on public pages - Remove redundant equality checks in navDisplayName (regex already covers them) - Remove unused `lower` variable after simplification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(dashboard): drop dead useHealthReadiness import in navbar The module was removed in #27896 (replaced by useHealthReadinessDetails), but the import survived the rebase. The symbol is unused — only useHealthReadinessDetails is consumed in the file. Removing the dead import unblocks the UI TypeScript build. * fix(dashboard): align CommunityEngagementButtons test with icon-only aria-labels The component was refactored to an icon-only chip with aria-label='LiteLLM on GitHub' (squash #27543), but the test still asserted /star us on github/i. Update the query to match the rendered accessible name. * refactor(dashboard): drop unused props from NavbarProps The navbar refactor moved user identity + dark-mode state to internal hooks (useAuthorized, useWorker), but the NavbarProps interface still declared userID, userEmail, userRole, premiumUser, isDarkMode, and toggleDarkMode as required, forcing every caller to thread them through. Drop them from the interface and all four call sites (page.tsx, (dashboard)/layout.tsx, public_model_hub.tsx, navbar.test.tsx). Also shrinks the destructure in layout.tsx so the now-unused locals stop being pulled out of useAuthorized(). * refactor(dashboard): use useSyncExternalStore for NotificationsBell dismiss flag Reads/writes of the litellmHideAgentPlatformBanner key were done directly inside NotificationsBell via a useEffect + useState pair. Every other localStorage-backed flag in the dashboard (Disable ShowPrompts, DisableBouncingIcon, DisableShowNewBadge, DisableUsageIndicator, DisableBlogPosts) is wrapped in a useSyncExternalStore hook over localStorageUtils so all mounted components stay in sync. Extract useHideAgentPlatformBanner to follow the same shape, swap NotificationsBell to consume it, and add a regression test that two sibling bells stay in sync without a remount when one is dismissed. * refactor: mask credential fields in proxy settings GET responses (#28682) * refactor: mask credential fields in proxy settings GET responses Brings SSO settings, cache settings, and the email/Slack alerting view in /get/config/callbacks in line with the HashiCorp Vault config-override pattern, so persisted credentials are not transported back to the UI in plaintext. * refactor: harden short-value masking and hoist alerting var constant Closes two review observations: - mask_sensitive_keys now replaces short values (below the visible prefix+suffix length) with an all-mask string instead of returning them unchanged, so a 1-7 character credential is no longer round-tripped verbatim. - _ALERTING_SENSITIVE_VARS is moved out of get_config() to a module-level constant, matching the analogous _SSO_SENSITIVE_FIELDS and _CACHE_SENSITIVE_FIELDS in the SSO and cache endpoint files. --------- Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * fix(ui): show 2-decimal precision for max_budget on key overview (#28809) The Key Info Overview tab's Spend card truncated sub-dollar budgets to "$0" because formatNumberWithCommas defaults to 0 decimals. The Settings tab passes 2; align the overview so a $0.10 budget renders as "$0.10". Resolves LIT-2845 * feat(proxy): allow `llm_api_routes` virtual keys to list MCP servers (#28442) * feat(proxy): allow llm_api_routes virtual keys to list MCP servers Add a new `mcp_discovery_routes` group (GET /v1/mcp/server and GET /v1/mcp/server/{server_id}) and include it in `llm_api_routes` so that virtual keys configured with `allowed_routes=["llm_api_routes"]` can discover the MCP servers they have access to. Previously these calls failed with 'Virtual key is not allowed to call this route. Only allowed to call routes: [llm_api_routes]'. The GET handlers already sanitize the response for restricted virtual keys via `_sanitize_mcp_server_list_for_virtual_key`, stripping credential-bearing fields (url, headers, env). Write methods (POST/PUT/DELETE) on the same paths remain gated by the existing handler-level admin role checks. The new discovery list is intentionally kept OUT of `mcp_inference_routes`, so `is_llm_api_route()` still returns False for these paths — this preserves the existing contract that DISABLE_LLM_API_ENDPOINTS must not block the Admin UI from listing MCP servers. Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> * refactor(proxy): make MCP discovery carve-out method-aware Replace the `mcp_discovery_routes` group in `llm_api_routes` with a method-aware special case inside `is_virtual_key_allowed_to_call_route`. Virtual keys with allowed_routes=["llm_api_routes"] are now permitted to call only GET /v1/mcp/server and GET /v1/mcp/server/{server_id} — non-GET methods and multi-segment admin sub-paths fall through to the existing 403. This keeps the general llm_api_routes list free of management paths and avoids accidentally exposing POST/PUT/DELETE writes through the route-check layer. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> * chore(ci): merge dev branch (#28807) * chore(proxy): route path-dependent call sites through get_request_route Replace direct ``request.url.path`` reads in auth, ACL, routing, and audit-log decisions with ``get_request_route(request)`` — the helper already added in ``auth/auth_utils.py`` that returns the ASGI ``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs ``url.path`` from the Host header; ``scope["path"]`` is uvicorn's parse of the request line and matches what FastAPI dispatches on, so it's the authoritative route for any decision that should agree with the actual handler. Sites: - _experimental/mcp_server/auth/user_api_key_auth_mcp.py - management_endpoints/mcp_management_endpoints.py - vector_store_endpoints/utils.py - pass_through_endpoints/pass_through_endpoints.py - auth/route_checks.py - litellm_pre_call_utils.py - spend_tracking/spend_management_endpoints.py - common_utils/http_parsing_utils.py - management_helpers/utils.py - health_endpoints/_health_endpoints.py Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py that construct a Request with scope["path"] set to a benign route and the Host header crafted so url.path would resolve differently; each site's decision is asserted against scope["path"]. * chore(proxy): make get_request_route imports lazy at call sites Move the ``from litellm.proxy.auth.auth_utils import get_request_route`` imports added in the prior commit back to the function bodies that use them. The module-level form participates in a long-standing import cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL on the PR; the lazy form matches the pattern the proxy already uses for ``user_api_key_auth`` and related helpers elsewhere in these files. Also drop the ``RouteChecks._is_assistants_api_request`` delegation in ``_get_metadata_variable_name`` introduced in the prior commit — the delegation pulled ``RouteChecks`` into the same cycle, and the call site reuses the resolved route for its other branches, so inlining the substring check is both cycle-free and avoids a redundant second ``get_request_route`` call. Comment in test_proxy_routes.py acknowledges that the two MCP table entries exercise ``get_request_route`` directly rather than the full production handler (which needs ASGI scope + MCP state to invoke). --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> * fix(team): keep team_alias cache in sync on _cache_team_object writes (#28737) * fix(team): keep team_alias cache in sync on _cache_team_object writes _cache_team_object wrote only to the team_id:<id> cache key, but the JWT auth path that uses team_alias_jwt_field reads from a separate team_alias:<alias> key (get_team_object_by_alias caches under both keys on miss, but reads only the alias-keyed one). After any team-mutation endpoint (team_model_add, team_model_delete, update_team, the two access-group writes) the team_id cache was refreshed but the team_alias cache stayed stale until TTL — JWT callers using team_alias_jwt_field kept seeing the pre-mutation team for the full cache window. Mirror the write under the alias key inside _cache_team_object so every existing caller stays in sync without further changes. Skip the alias write when team_alias is None/empty so we don't collide across alias-less teams. Surfaced testing the LIT-3244 cherry-pick on patch/1.86.0: the LIT-3244 fix correctly invalidated the team_id cache but the customer's JWT used team_alias_jwt_field, so they kept hitting the stale alias-keyed entry. * fix(team): delete (not overwrite) team_alias cache on _cache_team_object The prior shape of this PR wrote both team_id:<id> AND team_alias:<alias> from _cache_team_object. team_alias is NOT unique in the schema (no @unique on LiteLLM_TeamTable.team_alias), and get_team_object_by_alias enforces uniqueness on its own DB-fetch path (len(teams) > 1 raises). Writing the alias-keyed cache from the generic refresh path bypassed that check: a team admin renaming their team to collide with another team's alias could silently overwrite the cached team for JWT-by-alias auth, swapping the resolved team under that alias for the cache window. Switch the alias-keyed operation from a write to a delete (mirroring the dual-cache delete pattern in _delete_cache_key_object). After every team write, the next JWT-by-alias reader cache-misses and falls through to get_team_object_by_alias, which (a) re-fetches the fresh team from DB, closing the LIT-3244 staleness gap that motivated this PR, and (b) enforces alias uniqueness before populating either cache key. team_id:<id> writes are unchanged — team_id is the table PK and is guaranteed unique. Surfaced in veria-ai review on #28739. * fix(managed-files): anchor model_id regex so it doesn't match llm_output_file_model_id extract_model_id_from_unified_id used `re.search(r"model_id,([^;]+)", ...)` which substring-matches the `model_id,` inside the file-ID encoding's `llm_output_file_model_id,<deployment_uuid>` field. parse_unified_id then fed that deployment UUID back into the auth path as a model candidate via _extract_models_from_managed_resource_id, and every team-BYOK file attach 403'd with: team not allowed to access model. This team can only access models=['openai/*']. Tried to access <deployment-uuid> The team's models list correctly contains the public name (`openai/*`) that target_model_names matches, but the bogus UUID candidate fails the wildcard check first. Anchor the regex to a field boundary (`(?:^|;)model_id,`) so it matches the legitimate top-level `model_id,<value>` field on vector_store unified IDs and skips substring matches inside other fields. File-IDs (which have no top-level `model_id` field) now return None and contribute no spurious UUID candidate. Surfaced reproducing LIT-3244 on patch/1.86.0 with the customer's exact flow: team with openai/* BYOK deployment, JWT-scoped user, POST /v1/vector_stores/{id}/files attaching a file uploaded with target_model_names=openai/gpt-4o. * fix(proxy): hydrate wildcard discovery credentials (#28284) (#28822) * fix(proxy): hydrate wildcard discovery credentials * fix(proxy): constrain wildcard credential hydration Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com> * ci: add daily oss-agent-shin branch creation workflow (#28829) Creates litellm_oss_agent_shin_MM_DD_YYYY from main every day at 00:00 UTC. Lets us retarget oss-agent-shin fork PRs onto a canonical branch so CircleCI runs with secrets, without granting the agent write access. Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com> * test(proxy): add harness for proxy_server.py behavior-pinning (#28827) * test(proxy): add harness for proxy_server.py behavior-pinning Creates tests/test_litellm/proxy/proxy_server/ with: - conftest.py: 11 shared fixtures (app, client, mock_prisma, auth_as, mock_router with parametrized response builders, normalize, etc.) - _coverage_check.py: per-PR coverage gate (line + branch) against a baseline, self-selects target by inspecting which placeholder files have been filled - _pin_check.py: AST-based gate that verifies every pin-list item has >=1 happy + >=1 error test with a real assertion (no status-only) - test_harness_smoke.py: 19 smoke tests covering every fixture + both scripts end-to-end - 26 placeholder test files (one docstring each) reserved for follow-up PRs per the directory ownership in the Notion plan - .coverage_baseline pinned at 0% so future PRs measure deltas against new-tests-only and aren't entangled with the broader scattered test suite Adds a dedicated proxy-server job to test-unit-proxy-endpoints.yml so this directory's runtime + coverage are tracked independently. Plan: https://www.notion.so/36c43b8acdab81ee845fd5365128a2fc * ci(proxy-endpoints): allow workflow_dispatch Lets the workflow be triggered manually on a branch via `gh workflow run`, which is needed for the verify-first flow on workflow changes before opening a PR. * test(proxy): address review feedback on proxy_server harness - conftest.py: anchor sys.path insert to __file__ (Path(__file__).resolve().parents[4]) instead of CWD-relative os.path.abspath("../../../../") which resolved to the wrong directory when pytest is launched from the repo root. - _coverage_check.py: actually read .coverage_baseline and use it as the floor (line_min = max(target, baseline)). Closes the gap between the PR description's "delta semantics" and what the script was doing. With baseline=0.0 today this is a no-op; future PRs that update the baseline cause regressions (test deletions etc.) to trip the gate even if the static PR target is still met. - _pin_check.py: drop unreachable startswith("_") guard (test_*.py glob never yields underscore-prefixed names) and read each test file once instead of twice. * feat(openai): apply regional-processing cost uplift for EU/US data residency (#28626) * feat(openai): apply regional-processing cost uplift for EU/US data residency OpenAI charges a 10% uplift on the latest GPT models when requests are served from a regionalized hostname (eu./us.api.openai.com). Infer the region from `api_base`, expose it on `kwargs["litellm_params"]["data_residency"]`, and multiply the computed cost by a per-model `regional_processing_uplift_multiplier_<region>` field. https://claude.ai/code/session_012ebH44s7ohYxjoix5CXzTW * test: allow regional_processing_uplift_multiplier_{eu,us} in model_prices schema * fix(cost): tighten data_residency inference and restore model_cost in tests - Only infer OpenAI data_residency when custom_llm_provider == "openai"; drop the implicit None fallback so non-OpenAI callers can't accidentally pick up a regional tag from a stray OpenAI hostname. - _local_model_cost_map fixture now snapshots and restores litellm.model_cost and LITELLM_LOCAL_MODEL_COST_MAP so tests don't leak state across the session. * refactor(openai): move data_residency helper under llms/openai * fix: thread data_residency through realtime stream cost calculation Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(cost): thread data_residency through batch_cost_calculator Apply the OpenAI regional-processing uplift multiplier to retrieve_batch cost paths so Batch API requests served via eu./us.api.openai.com are priced at the same uplifted token rates as completions/transcriptions. * refactor(openai): encapsulate provider check inside infer_openai_data_residency Move the custom_llm_provider == "openai" guard from get_litellm_params into the helper itself so the core utility no longer carries provider-specific dispatch logic. Callers pass through the provider unconditionally; the helper returns None for any non-OpenAI provider. * fix(responses): thread data_residency through Responses logging params The Responses API paths build their logging litellm_params dict after provider resolution but did not include data_residency, so cost calc saw None even when the effective api_base was a regional OpenAI host. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MacBook-Pro.local> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com> Co-authored-by: ishaan-berri <155045088+ishaan-berri@users.noreply.github.com> Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com> * fix: preserve OTEL response payload and remove duplicate constant - Remove duplicate _CREDENTIAL_LITELLM_PARAM_FIELDS assignment in model_checks - Restore response=dict(result) in _emit_management_endpoint_otel_span so OTEL spans for successful management endpoint calls include response data Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: harden OTEL failure path and cap Galileo in-memory buffer - Wrap _emit_management_endpoint_otel_span in try/except on the failure path of management_endpoint_wrapper so OTEL errors cannot swallow the original management-endpoint exception. - Bound GalileoObserve.in_memory_records at GALILEO_MAX_IN_MEMORY_RECORDS to prevent unbounded memory growth when flushes persistently fail. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(galileo): reset stale bearer token on auth error; preserve records under concurrency - Snapshot record count before await so concurrent appends during the network round-trip aren't silently dropped when clearing the buffer. - Build payload from a snapshot list so the legacy path no longer shares a live reference with self.in_memory_records. - On legacy enterprise auth (username/password), drop cached bearer-token headers when the upstream rejects the request (401/403) so the next flush re-authenticates instead of failing forever on a stale token. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(galileo): expand v2 coverage for config, ingest, headers, and flush paths --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MacBook-Pro.local> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com> Co-authored-by: ishaan-berri <155045088+ishaan-berri@users.noreply.github.com> Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
…6590) * Add tool calling support for gemini and vertex ai live api * Fix greptile reviews * Add new functionality behind flag * fix greptile issues * Fix greptile review * Fix greptile review * Fix greptile review * Fix greptile review * Fix greptile review * fix lint * fix(realtime): address P1 issues - guardrail timing and inputAudioTranscription default - Remove early guardrail turn-detection update that consumed first setup slot - Add inputAudioTranscription default in Gemini deferred-mode setup - Add tests for both fixes Made-with: Cursor * fix(realtime): inject turn_detection into first session.update for deferred mode - Instead of sending turn_detection as separate message (which gets dropped), inject it into the first client session.update - This ensures guardrails work correctly in deferred mode - Add test for turn_detection injection in deferred mode Made-with: Cursor * fix(realtime): emit response.created preamble before tool-call events - Emit response.created, output_item.added, and conversation.item.created for function calls - Ensures OpenAI Realtime API spec compliance - Add test for preamble emission Made-with: Cursor * fix(realtime): add response.output_item.done to complete tool-call sequence - Emit response.output_item.done between function_call_arguments.done and conversation.item.created - Required by OpenAI Realtime spec to finalize function-call items - Update test to verify complete event sequence Made-with: Cursor * fix(realtime): emit response.done after tool-call sequence (P0 CRITICAL) - Add response.done event after tool-call loop to signal response completion - Required by OpenAI SDK clients to submit tool results - Without this, clients stall indefinitely waiting for response completion - Update test to verify complete 6-event sequence including response.done Made-with: Cursor * fix(realtime): include function name in toolResponse (P1) - Store call_id → name mapping when receiving toolCall from Gemini - Look up and include name in functionResponses when sending tool results - Required by Gemini Live API spec for proper tool call routing - Add test to verify name field is included in round-trip Made-with: Cursor * fix: resolve merge conflict markers in UI build chunk Take litellm_internal_staging version of e1a670efcb966aaa.js after incomplete merge left conflict markers in the committed artifact. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex_ai/realtime): call super().__init__() to initialize tool call state Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): correct guardrail flag and event-mapping fallback - realtime_streaming: only mark _guardrail_turn_detection_update_sent when the message was actually delivered to the backend. The provider transformation (e.g. Gemini after initial setup) may silently drop session.update; previously we set the flag anyway, falsely claiming the disable was sent and preventing any retry on subsequent session.created events. _send_to_backend now returns whether at least one transformed message was sent. - gemini realtime transformation: avoid shadowing the outer openai_event variable in map_openai_event's fallback loop. With the new toolCall entry now last in MAP_GEMINI_FIELD_TO_OPENAI_EVENT, an unmatched key would otherwise leak FUNCTION_CALL_ARGUMENTS_DONE and skip the ValueError raise. Use a distinct loop variable so the is-None check correctly raises for unknown Gemini messages. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini/realtime): reset response IDs after tool-call response.done After closing a tool-call response, clear current_output_item_id and current_response_id so post-tool model turns emit a fresh response.created preamble. Add regression tests and align guardrail turn_detection test with GA session shape; apply Black formatting. Co-authored-by: Cursor <cursoragent@cursor.com> * fix lint * fix(realtime): log injected message and forward guardrail VAD-disable on Gemini - Move store_input() after the guardrail turn_detection injection in client_ack_messages so audit logs reflect what is actually forwarded to the backend (previously the unmodified pre-injection message was logged). - In Gemini's _handle_session_update, allow a session.update that only carries a turn_detection change to be forwarded as a follow-up Gemini setup with realtimeInputConfig.automaticActivityDetection set, even after the initial setup. This restores the guardrail layer's ability to disable VAD auto-response in non-deferred mode (the default Gemini flow), which was a regression after _handle_session_update started silently dropping subsequent session.update messages. Both flat beta-style and nested GA-style turn_detection payloads are accepted. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini/realtime): resolve mypy TypedDict errors in transformation Align realtime event payloads and setup types with OpenAI/Gemini TypedDicts so mypy passes and tool-call events type-check correctly. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(realtime): forward turn_detection updates for Vertex; respect partial VAD config; cache setup after send Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): consolidate send-and-cache, guard session.update lookup, preserve client turn_detection in GA remap - Replace duplicated transform/send/cache logic in client_ack_messages with a call to _send_to_backend so future changes stay in one place. - VertexAIRealtimeConfig.transform_realtime_request now uses .get('session') or {} for the first session.update so a malformed client payload no longer crashes the connection. - Move the audio-transcription guardrail turn_detection injection to run BEFORE the beta->GA session remap. This lets the injected create_response ride along with any client-provided turn_detection fields (e.g. silence_duration_ms) into the nested audio.input.turn_detection path produced by the remap instead of being stranded as a separate root-level dict. - Update the deferred-mode injection test to assert the GA-shaped location. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): pop tool_call_id mapping after use to bound memory Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): correct deferred-setup session.created modalities and reset IDs after response.done - Convert provider's real session.created to session.updated when a synthetic one was already forwarded so clients receive the authoritative modalities derived from their session.update instead of the synthetic placeholder. - Reset current_response_id / current_output_item_id after Gemini RESPONSE_DONE so a toolCall arriving in a later frame starts a fresh response instead of reusing the completed response's ID and emitting a duplicate response.done. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini-realtime): preserve nested turn_detection through map_openai_params After the GA remap moves session.turn_detection into session.audio.input.turn_detection, Gemini's map_openai_params only looks at top-level keys and silently drops it. Normalize the extracted turn_detection back to the top level on first session.update so the guardrail create_response:False (and any client-provided VAD settings) reach the Gemini setup. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): normalize Vertex AI nested turn_detection and unify session.created guardrail ordering - Vertex AI _build_vertex_ai_setup_config now lifts nested audio.input.turn_detection to the top level before calling map_openai_params, mirroring the parent GeminiRealtimeConfig behavior. Without this, guardrail-injected create_response: False was silently dropped for GA-protocol Vertex AI clients. - realtime_streaming session.created handling now sends the (possibly re-typed) event first and then triggers the guardrail turn-detection update for both first and duplicate cases, removing the inconsistent guardrail-then-event ordering for duplicates. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): tolerate non-dict turn_detection in guardrail injection When a client sends a session.update whose turn_detection field is None or a non-dict value (e.g. "auto"), the guardrail injection used setdefault followed by item assignment on the returned value, raising TypeError. The inner except only caught JSONDecodeError/AttributeError, so the TypeError escaped to the outer Exception handler that wraps the entire client_ack loop, killing the connection. Replace non-dict turn_detection with a fresh dict carrying create_response=False so the guardrail still applies without crashing the loop. * fix(gemini realtime): default synthetic session.created modalities to AUDIO The synthetic session.created event emitted in deferred setup mode used TEXT as the default for responseModalities, while _handle_session_update defaults to AUDIO. Align the default so clients reading modalities from the initial session.created see the correct value for live sessions. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex_ai/realtime): drop follow-up session.update to avoid 1007 close Vertex AI Live treats setup as a first-and-only client message; emitting a second setup with realtimeInputConfig only closes the websocket with a 1007 policy error. Reverting the follow-up-setup branch restores the pre-existing no-op behavior for subsequent session.update messages. * fix(gemini realtime): default responseModalities to AUDIO in delta events Align return_new_content_delta_events with the AUDIO defaults used in _handle_session_update and transform_session_created_event so deferred session config does not produce TEXT-typed delta events for audio data. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): default response.done modalities to AUDIO and correct audio-done test * fix(realtime): set guardrail turn_detection flag only after successful send Previously the _guardrail_turn_detection_update_sent flag was set inline during message rewriting in client_ack_messages, before the modified session.update was forwarded to the backend. If _send_to_backend raised (e.g. backend WebSocket disconnect), the exception was caught and the loop continued, but the flag remained True — permanently disabling the guardrail create_response=False injection for the rest of the session. Neither the client_ack_messages path nor the _maybe_send_guardrail_turn_detection_update backup path would retry. Track the injection locally and only set the flag after _send_to_backend returns a truthy sent result, matching the pattern used by _maybe_send_guardrail_turn_detection_update. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex_ai realtime): keep VAD enabled when guardrails inject create_response: False map_automatic_turn_detection sets disabled=True whenever create_response is absent OR False. Transcription guardrails inject create_response: False to suppress auto-responses while expecting VAD to stay active, but the previous override in _build_vertex_ai_setup_config only fired when create_response was absent, leaving disabled=True and silently breaking speech detection and transcription events. Vertex Live has no 'VAD on, no auto-response' mode, so always keep VAD active in the setup config. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): normalize GA-remapped session fields before mapping map_openai_params only recognises the flat OpenAI-beta keys (modalities, input_audio_transcription, turn_detection). For GA clients the upstream shim renames these into the nested GA schema (output_modalities, audio.input.transcription, audio.input.turn_detection), causing them to be silently dropped in _handle_session_update. Add a normalization helper that surfaces the GA-remapped values back at the top level so the existing mapping logic picks them up. Without this, a GA client explicitly requesting modalities=['text'] would still default to audio output. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex_ai/realtime): normalize all GA-remapped session fields before mapping Previously _build_vertex_ai_setup_config only lifted nested turn_detection back to the top level. GA clients' output_modalities and audio.input.transcription were silently dropped because map_openai_params only recognises the flat OpenAI-beta keys. Use the parent's _normalize_session_payload_for_mapping so modalities, transcription, and turn_detection are all surfaced before mapping. * fix(realtime): force create_response=False in all client session.update turn_detection when audio guardrails active Prevents a client from re-enabling Gemini/GA VAD auto-response (and thereby bypassing the audio transcription guardrail) by sending a later session.update with turn_detection.create_response: true. * fix(lint): silence PLR0915 on client_ack_messages The function exceeded the 50-statement limit (64 > 50) after recent realtime guardrail additions. Matches the existing project pattern for inherently complex event/message-mapping methods (see _process_event, translate_messages_to_responses_input, transform_realtime_response, _arealtime, etc.). * fix(gemini realtime): preserve original setup config on follow-up session.update Gemini Live treats a second BidiGenerateContentSetup as a full session replacement, not a partial merge. The guardrail-driven turn_detection-only session.update was emitting a setup containing only model + realtimeInputConfig, which would silently drop tools, generationConfig, inputAudioTranscription, and systemInstruction from the original setup. Carry forward the cached original setup and only override realtimeInputConfig. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): avoid double-serialization and normalize non-dict turn_detection in guardrail override - Skip the force-override block when the injection block already ran for the same session.update to avoid redundant JSON re-serialization. - Normalize non-dict client-provided turn_detection values (flat and nested audio.input.turn_detection) to a dict before enforcing create_response=False, matching the injection block's behavior and preventing potential bypass on backends that accept non-dict values. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(gemini realtime): exercise toolCall → function_call_output name round-trip Update test_gemini_realtime_function_call_output_transformation to pre-load the call_id → name mapping by transforming a Gemini toolCall first, then assert that the resulting Gemini toolResponse functionResponses entry carries the function name. This pins the production round-trip rather than the degenerate 'name missing' branch. * fix(realtime): correct conversation_id, VAD disable, modality state, empty toolCall - Gemini tool-call response.done now includes conversation_id so clients can match it against the preceding response.created. - Vertex AI setup no longer overrides an explicit guardrail-injected create_response: False back to disabled: False; the guardrail's intent to disable VAD auto-response is now respected. - Modality handler is now passed the locally-updated response/item IDs rather than the original input snapshot, preventing stale IDs after a prior tool-call/response.done in the same JSON message resets them. - Skip emitting orphaned response.created/response.done events when Gemini sends an empty functionCalls array. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): preserve client session.update fields on follow-up Gemini setup In non-deferred mode the auto-setup pre-populates session_configuration_request, so a later client session.update carrying tools or instructions used to fall into the subsequent path and only forward turn_detection. Rebuild a merged follow-up setup that overlays the new client fields on top of the original setup so tools/instructions/etc. are no longer silently dropped. * fix(gemini realtime): include usage on tool-call response.done; coerce non-dict tool output to struct - Tool-call response.done now includes an empty usage object, matching the non-tool-call path so OpenAI-compatible clients always see usage. - _handle_function_call_output wraps non-dict JSON parses under a 'result' key so Gemini's functionResponses[].response (a Struct) always receives a mapping. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): deep-merge nested config in follow-up session update Previously, the follow-up setup performed a shallow merge between the original setup and new overrides. If a session.update touched any field inside generationConfig (e.g. modalities), the entire generationConfig would be replaced, silently dropping unrelated sub-keys like temperature or maxOutputTokens. Apply the same deep-merge to realtimeInputConfig so partial automatic-activity-detection updates don't drop other realtime input config fields either. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): default conversation_id before tool-call response.done mypy flagged that response.done's conversation_id (str on the TypedDict) could be None when current_response_id was already set on entry. Ensure the fallback runs unconditionally before the response is constructed. * fix(realtime): deep-merge generationConfig and refresh cache on follow-up setup A subsequent Gemini session.update that touches any generationConfig sub-field (e.g. just temperature) was clobbering the original generationConfig — silently dropping responseModalities and switching the session to text-only. Deep-merge generationConfig so existing keys (responseModalities, maxOutputTokens, ...) are preserved when the client updates only a subset. Also drop the early-return in _cache_session_configuration_request so the cached payload tracks the latest setup sent to the backend. Without this, downstream readers (transform_session_created_event, modality lookup in return_new_content_delta_events) keep reading stale modalities/system instruction after a follow-up setup. * fix(gemini realtime): mirror modalities/temperature/max_output_tokens on tool-call response.created The audio/text response.created preamble includes modalities, temperature, and max_output_tokens on the response object so spec-compliant clients can initialise per-response state. The tool-call response.created was missing these fields, leaving clients without consistent response metadata when a response starts with a tool call instead of content. Read them from the cached session_configuration_request the same way the audio/text path does. * fix(gemini realtime): keep call_id→name mapping across function_call_output retries A client SDK that retries function_call_output (or sends the same result twice) would previously hit a missing-name lookup on the second send because _handle_function_call_output popped the call_id → name entry. Without name, Gemini may silently reject the response. Use dict.get so the mapping persists for the lifetime of the session. * fix(gemini realtime): empty toolCall must not terminate the WebSocket If Gemini sends a toolCall whose functionCalls list is empty (or absent), the previous `continue` left returned_message empty and the "Unknown message type" guard fired, killing the WebSocket session. Return a normal (empty) result instead so the session keeps going. * fix(vertex realtime): warn when dropping guardrail turn-detection update In non-deferred mode the auto-setup is sent on connect, so the audio-transcription guardrail's subsequent session.update carrying turn_detection.create_response=False cannot be forwarded as a second setup (Vertex Live closes the WebSocket with 1007). Surface a warning when this specific drop happens so operators know the model will auto-respond before the guardrail can gate it, instead of failing silently at debug level. * fix(gemini realtime): deep-merge automaticActivityDetection on follow-up session.update The follow-up setup merge already deep-merged generationConfig and realtimeInputConfig, but realtimeInputConfig.automaticActivityDetection itself is a nested dict. A partial VAD update (e.g. the guardrail-injected disabled=True from create_response=False) silently dropped unrelated knobs such as silenceDurationMs and prefixPaddingMs from the original setup. Deep-merge that block too so partial overrides only touch the fields they specify. * fix(realtime): record synthetic session.created in deferred-setup mode The deferred-setup path emits a synthetic session.created directly to the client websocket but did not run it through RealTimeStreaming's store_message, so the event was missing from the session log used by success_handler / async_success_handler. Call store_message before forwarding so the synthetic event lands in the same log stream as provider-driven events. * fix(gemini realtime): bound _tool_call_id_to_name with an LRU; exercise modality forwarding test Two minor follow-ups from review: * Switch _tool_call_id_to_name to a 256-entry LRU OrderedDict so a long session with many tool calls doesn't grow the dict without bound, while retried function_call_output lookups still hit for recently-seen call_ids. * Fix test_gemini_realtime_transformation_session_created to wrap the cached session config in {"setup": ...} so the modality lookup in transform_session_created_event actually exercises responseModalities forwarding (the prior payload was silently treated as empty). * test(gemini realtime): wrap remaining cached session configs in setup envelope The session_configuration_request the proxy caches is always serialized as {"setup": ...}; three modality-related tests dumped a bare config dict instead, so transform_session_created_event's `.get('setup', {})` quietly returned an empty dict and the responseModalities lookup ran against the default rather than the fixture. Wrap the remaining tests in the same shape the production cache uses so any regression in modality forwarding actually trips. * fix(gemini realtime): cast merged realtimeInputConfig for typeddict assignment mypy flagged the assignment of the merged dict into BidiGenerateContentSetup.realtimeInputConfig with [typeddict-item]: the intermediate variable widens to dict[Any, Any], losing the TypedDict narrowing the previous dict-literal form had. * test(gemini realtime): wrap test_gemini_tool_call_resets_ids fixture in setup envelope The cached session_configuration_request the proxy stores is always serialized as {"setup": ...}; this test passed a bare config dict, so transform_session_created_event's .get('setup', {}) returned an empty dict and the responseModalities lookup ran against the default rather than the fixture. Wrap the fixture in the same shape the production cache uses. * fix(gemini realtime): skip unknown sibling keys in transform loop Gemini realtime messages can include sibling metadata keys like usageMetadata alongside primary payload keys (toolCall, serverContent). Previously, the transform loop called map_openai_event for every top-level key, raising ValueError for unknown ones and terminating the WebSocket session. Skip top-level keys not present in MAP_GEMINI_FIELD_TO_OPENAI_EVENT to keep the session alive when Gemini emits usage metadata with a toolCall response. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): scope dotted-key event lookup and propagate session metadata to tool-call response.done - map_openai_event: only check the current key/value pair when resolving dotted map entries (e.g. serverContent.turnComplete) so a sibling key in the same frame can't misclassify the event being processed (e.g. toolCall returning RESPONSE_DONE). - tool-call path: extract generationConfig once and include modalities, temperature, and max_output_tokens on response.done so its shape matches response.created and the non-tool-call response.done. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): cast maxOutputTokens to int for typeddict assignment * fix(gemini realtime): use camelCase maxOutputTokens in response.done Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): cast maxOutputTokens to int for typeddict assignment * fix(realtime): inject guardrail turn_detection on subsequent session.update without one Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): tolerate sibling-only frames (e.g. standalone usageMetadata) A Gemini Live frame that contains only metadata keys outside _KNOWN_GEMINI_TOP_LEVEL_KEYS (e.g. a bare {"usageMetadata": {...}} emitted between turns) leaves returned_message empty after the transform loop and was tripping the 'Unknown message type' guard, which raised ValueError and terminated the WebSocket session. Treat such frames as no-ops and return the unchanged state instead. * fix(gemini realtime): preserve sibling toolCall when serverContent has only transcription Previously, when a Gemini frame contained both a transcription-only serverContent and a sibling toolCall, the transcription handler would early-return and silently drop the toolCall. Instead, mark serverContent as handled and fall through so the main loop still processes siblings like toolCall, while preserving the prior no-op behavior for empty/ transcription-only frames. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(gemini realtime): drop unused json_message arg from map_openai_event Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): promote nested turn_detection when flat value is not a dict When the session payload had `turn_detection: None` (or any non-dict value), the normalizer skipped promoting the GA nested `audio.input.turn_detection` because it only checked key presence. The stale None then flowed into `map_automatic_turn_detection` and raised TypeError on `'create_response' in value`. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(realtime): run guardrails on function_call_output content Tool result outputs are client-controlled and fed to the model, so they must pass the same content checks as user text messages. Otherwise an attacker can smuggle blocked content into a function_call_output and have the model process it. * fix(gemini realtime): emit function_call_arguments.delta before .done Gemini delivers the full function-call arguments in a single toolCall frame. The OpenAI Realtime spec orders the streaming events as output_item.added -> function_call_arguments.delta(+) -> function_call_arguments.done -> output_item.done. Emit a single delta carrying the complete arguments string before the matching .done so spec-compliant SDK clients that accumulate deltas and gate finalisation on at least one delta arriving do not stall on Gemini tool calls. * fix(realtime): avoid stale session.created flag triggering guardrail re-injection Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(ci): restore guardrail injection on duplicate session.created and cast realtime delta event - Re-enable the one-time guardrail turn_detection update on duplicate session.created. `_maybe_send_guardrail_turn_detection_update` is already idempotent via `_guardrail_turn_detection_update_sent`, so the previous guard was unnecessary and broke the deferred-setup path where the synthetic session.created is emitted by llm_http_handler outside this loop (no prior chance to inject). - Cast the response.function_call_arguments.delta dict appended to `returned_message: List[OpenAIRealtimeEvents]` so mypy is satisfied. * fix(realtime): forward sanitized function_call_output on guardrail block Providers that pair every toolCall with a toolResponse (e.g. Gemini and Vertex Live) stay in the awaiting-tool-call state until a toolResponse arrives. Dropping a blocked function_call_output outright left those providers stalled — the subsequent guardrail clientContent and response.create were ignored because the prior toolCall had no matching toolResponse. When the client-supplied tool output fails the realtime guardrail check, forward a sanitized placeholder function_call_output (same call_id, generic policy marker as output) instead of dropping the message entirely. The placeholder carries no blocked content, so the model never sees it, while still completing the provider's tool-call cycle so the session can recover and the violation message reaches the user. * fix(gemini realtime): preserve sibling keys on empty toolCall no-op Replace the early return on `functionCalls` empty/absent with a `continue` plus a `tool_call_handled` flag that mirrors the existing `server_content_handled` pattern. The post-loop guard already distinguishes intentionally-consumed known keys from genuinely-unknown messages, so adding `toolCall` to that exclusion list lets the loop continue iterating over any sibling top-level keys in the same Gemini frame instead of short-circuiting on the first empty toolCall. In practice Gemini's protobuf places `toolCall`/`serverContent`/ `setupComplete` in a `oneof` so the only realistic sibling is `usageMetadata` (already filtered as unknown-top-level), but the uniform handling avoids silently discarding any future sibling key should the wire format grow. * fix(gemini realtime): redact realtime payloads from debug logs The transform_realtime_response debug logs were dumping the raw inbound Gemini frame and each outbound OpenAI event payload (up to 500 chars). Realtime frames carry transcripts, model output, and tool-call arguments, so those strings ended up in application logs whenever DEBUG was enabled. Replace the inbound dump with just the top-level frame keys and the outbound dump with just the event type. * fix(realtime): check function_call_output before user role to prevent guardrail bypass Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): propagate usageMetadata on tool-call response.done Gemini Live emits usageMetadata as a sibling top-level key alongside the toolCall frame; the tool-call branch was unconditionally building response.done from get_empty_usage(), so tokens consumed by tool-call turns were recorded as zero spend and bypassed LiteLLM budget accounting. Mirror the non-tool-call RESPONSE_DONE path: when the same frame carries usageMetadata, run VertexGeminiConfig._calculate_usage and forward the real token counts. * fix(realtime): send sanitized toolResponse before guardrail clientContent Two related fixes for the function_call_output blocked-by-guardrail path: 1. Ordering: Gemini Live requires a matching toolResponse immediately after a toolCall before any other client message. Previously we ran the guardrail first (which sends clientContent/cancel) and only then forwarded the sanitized function_call_output. Add an optional pre_block_backend_message arg to run_realtime_guardrails so the sanitized toolResponse is emitted before the guardrail's own backend messages. 2. Stale pending flag: stop setting _pending_guardrail_message in the tool-output block. That flag exists to swallow the reflexive response.create an OpenAI client sends right after a user text message. In tool-calling flows the client may never send a response.create (e.g. Gemini SDKs auto-respond), so leaving the flag set would consume an unrelated response.create from a later turn. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(model_prices): allow audio_transcription_config in schema * fix(gemini realtime): event_id, item copy, and dict guard for tool-call events - Emit event_id on response.output_item.added for tool calls so spec-compliant OpenAI Realtime SDK clients can index/deduplicate the event like every other server-sent event in the sequence. - Pass a shallow copy of function_call_item to response.output_item.done and conversation.item.created so downstream handlers (e.g. the beta-protocol translator) that mutate the item dict don't corrupt sibling events sharing the same reference. - Guard map_openai_event against non-dict values (e.g. Gemini's 'setupComplete: true' boolean payload) so the WebSocket session doesn't die with an AttributeError on the unguarded .get() call. Add NotRequired event_id field on OpenAIRealtimeStreamResponseOutputItemAdded to keep existing call-sites that don't set event_id compatible. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(gemini realtime): buffer standalone usageMetadata for next response.done Gemini Live can emit usageMetadata as a standalone WebSocket frame between turns. The previous transformer treated those frames as no-ops, so token counts arriving outside the closing turnComplete/toolCall frame were dropped from spend and budget accounting. An authenticated client could drive turns whose usage was recorded as zero, bypassing budgets. Buffer any standalone usageMetadata on the config instance and attribute the deferred counts to the next emitted response.done (tool-call or normal). In-frame usageMetadata remains authoritative and clears the buffer. * merge main (#28839) * fix(helm): drop main- prefix from default image tag (#28710) * fix(helm): drop main- prefix from default image tag The default image tag in the deployment + migrations-job templates was `main-{{ .Chart.AppVersion }}`. The current release pipeline publishes content tags without the `main-` prefix (e.g. `v1.85.1` / `1.85.1`, `v1.86.0-rc.1` / `1.86.0-rc.1`), so the rendered ref points at a tag that does not exist on GHCR or DockerHub and installs fail with ImagePullBackOff. - templates/deployment.yaml, templates/migrations-job.yaml: render `.Chart.AppVersion` directly instead of `main-<AppVersion>`. - Chart.yaml: bump stale `appVersion: v1.80.12` (not on either registry) to `v1.85.1` so local-checkout installs also resolve. - values.yaml: update the commented tag-override hint to match. * fix(helm): use :latest in tag override example, not pinned version Per review: ghcr.io/berriai/litellm-database:latest is a floating alias for the most recent stable (same digest as :main-stable), maintained by the release pipeline's UPDATE_LATEST advance step. Better example than a pinned version that goes stale. * test(model_prices): allow audio_transcription_config in schema (#28708) The schema in test_aaamodel_prices_and_context_window_json_is_valid uses additionalProperties: false. The azure/speech/azure-stt entry added in #27482 introduced an audio_transcription_config field that the schema did not whitelist, so the test fails on every branch built on top of staging. Add the field as a string property. * fix(team): refresh team cache on team_model_add/delete (LIT-3244) (#28683) * fix(team): refresh team cache on team_model_add/delete (LIT-3244) team_model_add and team_model_delete wrote to the DB but did not invalidate the in-memory LiteLLM_TeamTableCachedObj used by common_checks. After the v1.83.14 common_checks centralization made team.models authoritative on /v1/files and /v1/vector_stores/*, adding a Team-BYOK model silently failed to grant the new public model name to team members until the cache TTL expired (and a removed model kept working until then on the symmetric path). Extract the cache-refresh snippet from update_team into a small helper and apply it consistently at all three team-write sites. * test: also assert updated models in team-cache-refresh pin Strengthens the LIT-3244 regression test to also assert `call_kwargs["team_table"].models` matches the updated row, not just `team_id`. Both `existing_team` and `updated_team` share `team_id` in the test setup, so the previous assertion would have passed even if the implementation accidentally cached the pre-mutation row. Greptile review feedback. * fix(team): hydrate object_permission on cache-refreshing team updates The Prisma update calls in update_team, team_model_add, and team_model_delete returned a team row with object_permission_id set but object_permission=None (the relation was not requested via include=). _refresh_cached_team then wrote that to the in-memory LiteLLM_TeamTableCachedObj, and the cache-hit path in get_team_object returns the cached object without re-hydrating. Downstream consumers (validate_key_search_tools_against_team, the MCP/agent authz paths) treat a missing object_permission as no team-level restriction, so a team-write op silently dropped object-permission enforcement until the cache TTL expired or a DB-fetch path re-hydrated it. Add include={"object_permission": True} to all three updates so the refresh writes a complete cached team. Extend the LIT-3244 regression test to pin both the cached object_permission and the include shape on the Prisma call. Surfaced in PR review of LIT-3244. * fix(ui/add-model): stop vertex_ai-anthropic_models from leaking under Anthropic (#28723) `getProviderModels()` matched a model into a provider's dropdown when the model's `litellm_provider` string *contained* the provider key as a substring. The intent was to admit suffix variants (e.g. `anthropic_text`, `bedrock_converse`), but the substring check is too loose: it also pulls in unrelated providers whose name happens to contain the key, most visibly `vertex_ai-anthropic_models` matching `anthropic` and `vertex_ai-openai_models` matching `openai`. Replace `.includes()` with separator-anchored prefix matching (`startsWith(provider + "_")` / `startsWith(provider + "-")`). All legitimate variants in `model_prices_and_context_window.json` still match (`anthropic_text`, `azure_text`, `azure_ai`, `bedrock_converse`, `bedrock_mantle`, `cohere_chat`, `fireworks_ai-embedding-models`, `vertex_ai-*`, `vertex_ai_beta`), and the cross-provider leak is closed. Tests: update one assertion that pinned the buggy substring behavior (`custom_openai_endpoint` matching `openai` — not a real provider value); add 6 new tests covering the leak regressions and the variant-preservation contract for vertex_ai/bedrock/fireworks. * Fix spend logs v2 route permissions (#28705) Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> * fix(proxy): Bedrock Knowledge Base pass-through: preserve SigV4 headers and signed request body (#27526) * Fix Bedrock KB pass-through SigV4 headers and signed body Coerce botocore HeadersDict to a dict for pass-through routes. When forward_headers is true, drop request headers that collide case-insensitively with signed headers so client Bearer auth does not shadow AWS SigV4. Send prepped.body as raw content so the outbound payload matches the signature after logging hooks mutate the parsed dict. Co-authored-by: Cursor <cursoragent@cursor.com> * Simplify pass-through raw body handling Read the SigV4-signed bytes directly from request.state inside pass_through_request instead of threading a custom_raw_body argument through three functions. Helper methods are restored to their original signatures, and the new branch lives in one place at each httpx call site. Co-authored-by: Cursor <cursoragent@cursor.com> * Harden pass-through raw body read from request.state Guard missing request.state (test fixtures) and ignore non-bytes/str values so MagicMock does not trigger the SigV4 raw-body path. Co-authored-by: Cursor <cursoragent@cursor.com> * Test pass_through_request state_raw_body uses httpx content= Cover non-streaming (async_client.request) and streaming (build_request) paths so SigV4 bytes on request.state are not replaced by json= of a hook-mutated dict. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * chore(tests): migrate Bedrock CI to AWS account 941277531214 (#28728) * chore(tests): migrate Bedrock CI from AWS account 888602223428 to 941277531214 The original account (888602223428) was put under a security restriction by AWS after a root access key leaked in a PR comment. While that account works its way through the AWS Support unlock process, Bedrock-touching CI tests have been migrated to a fresh account (941277531214). Changes: - Replace 26 hardcoded references to 888602223428 with 941277531214 across 8 files (provisioned-model ARNs, imported-model ARNs, AgentCore runtime ARNs, batch execution role ARN, and example proxy config). - The provisioned-model and imported-model ARNs are referenced only from mocked unit tests — no AWS resources to recreate. - The batch execution IAM role has been recreated in the new account with the same name and equivalent permissions. - The two AgentCore runtimes (hosted_agent_r9jvp-3ySZuRHjLC, hosted_agent_13sf6-cALnp38iZD) are being recreated in the new account under the same names — see tools/agentcore-deploy/ in a follow-up. CircleCI env vars AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION_NAME were updated separately via the CircleCI API to point at the new account. Smoke-tested locally against the new account: aws bedrock-runtime converse --region us-west-2 \ --model-id us.anthropic.claude-sonnet-4-5-20250929-v1:0 \ --messages '[{"role":"user","content":[{"text":"ping"}]}]' → 200, model returned 'pong' Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(tests): refresh AgentCore ARN suffixes to match newly-deployed runtimes The first migration commit replaced just the account ID, but AgentCore auto-assigns a random 10-char suffix to every runtime on creation — we can't reuse the original suffixes (`3ySZuRHjLC`, `cALnp38iZD`) in the new account. Updated the AgentCore-runtime ARNs in the three files that reference real runtime IDs (not the mock-based unit-test ARNs). Deployed runtimes: arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_r9jvp-Rq79QFC2fp arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_13sf6-4046UzHSwy Both runtimes are status=READY and pass a smoke invoke: $ aws bedrock-agentcore invoke-agent-runtime --agent-runtime-arn ... --payload '{"prompt":"ping"}' → 200, {"result": "echo: ping"} The agent is a minimal echo (see /tmp/agentcore_deploy/agent.py for the deploy artifacts). Tests that only verify the SDK wiring will pass; if any test asserts on agent output content, swap the echo for the real agent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(tests): point Bedrock batch tests at new-account S3 bucket The account migration (888602223428 -> 941277531214) was a flat account-ID swap, which only rewrites ARNs that embed the account number. S3 bucket names carry no account ID, so the live Bedrock batch tests still uploaded to `litellm-proxy` — a bucket that lives in the old account. S3 names are globally unique, and the old account still holds that name, so it can't be recreated in the new account. Rename to `litellm-proxy-941277531214` (account-ID suffix guarantees global uniqueness). The bucket must be created in 941277531214 and the batch execution role granted s3:GetObject/PutObject/ListBucket on it before this job is run in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): point live S3 logging test at new-account bucket Same account-ID-free blind spot as the batch bucket: `load-testing-oct` lives in the old account and its name can't be reused globally. The `logging_testing` CI job is wired into the workflow and runs test_basic_s3_logging, which uploads to this bucket with the CI env creds, then lists and deletes objects — a live dependency. Rename to `load-testing-oct-941277531214`. The bucket must exist in the new account with the CI IAM principal granted s3:PutObject/GetObject/ListBucket/DeleteObject before this job runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(tests): repoint Bedrock guardrail IDs to new-account guardrails The migration left guardrail IDs untouched (no account ID in them), so all live guardrail tests failed with "guardrail identifier or version does not exist" against 941277531214. Recreated both guardrails in the new account and updated the hardcoded IDs: - wf0hkdb5x07f -> zgkmukebruil (PII mask: PHONE + CREDIT_DEBIT_CARD, with explicit inputAction=ANONYMIZE so masking applies to INPUT, which is the source litellm's moderation hook sends) - ff6ujrregl1q -> 4w3d1di3snt5 (blocks "coffee"; blocked message set to the exact string the tests assert on) Updated test_bedrock_guardrails.py, otel_test_config.yaml, and the guardrailConfig in test_bedrock_completion.py. Verified locally: the 5 previously-failing guardrail tests now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): migrate legacy models to current inference profiles The new CI account (941277531214) cannot invoke legacy Bedrock models (AWS gates them: "marked by provider as Legacy... not actively using in the last 30 days"). Migrated the live-call tests: - anthropic.claude-3-sonnet-20240229 -> us.anthropic.claude-sonnet-4-5-20250929-v1:0 - anthropic.claude-3-haiku-20240307 -> us.anthropic.claude-haiku-4-5-20251001-v1:0 Current Claude models on Bedrock require the us. inference-profile prefix (bare on-demand ids are rejected). cohere.command-r-plus has no working replacement (all Cohere is legacy- gated in the new account): swapped to claude-haiku-4-5 in provider- agnostic param lists. amazon.titan-image-generator skipped (no working replacement). Mocked/transformation/cost tests that reference the legacy strings are intentionally left unchanged. Verified live against the new account. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): repoint SageMaker + Knowledge Base to new-account resources These referenced account-scoped resources by hardcoded id that only existed in the old account, so the migration's account-ID swap missed them. Recreated in 941277531214 and repointed: - SageMaker endpoint jumpstart-dft-hf-textgeneration1-mp-20240815-185614 -> litellm-ci-textgen (gpt2 on a TGI container, ml.g5.xlarge) - Bedrock Knowledge Base T37J8R4WTM -> LCYXFBR2TU (OpenSearch Serverless vector store + titan-embed-text-v2, seeded with a LiteLLM doc) Verified live: test_sagemaker.py (12 passed) and test_bedrock_knowledgebase_hook.py (12 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(reasoning_effort_grid): skip bedrock claude-opus-4-7 cells (not entitled on 941277531214) claude-opus-4-7 is listed in the new Bedrock CI account's foundation models but invoke is denied (AccessDeniedException: "not available for this account"). Bedrock access to the flagship Opus requires an AWS Sales request, not the self-serve model-access toggle, so it can't be enabled inline with the rest of the account migration. Add an optional `skip_reason` to ModelEntry and set it on the bedrock-claude-opus-4-7 entry; the grid test honors it via pytest.skip. Cell count (231) and route coverage are unchanged, so the structural asserts still pass. Restore coverage by deleting the one skip_reason line once access is granted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bedrock): swap/skip legacy-gated models unavailable on new CI account The migrated AWS account (941277531214) cannot access several models that the old account could, so the remaining red CI jobs were hitting real Bedrock "Access denied / Legacy" and "account not authorized" errors: - image_gen: skip both Nova Canvas test classes (amazon.nova-canvas-v1:0 is legacy-gated), matching the existing titan skip. - batches: skip test_async_file_and_batch (Bedrock batch inference is not authorized on the new account; requires an AWS support case). - litellm_overhead: swap legacy claude-3-5-haiku for the active us.anthropic.claude-haiku-4-5 inference profile. - test_completion_claude_3_function_call: swap legacy claude-3-sonnet for the active us.anthropic.claude-sonnet-4-5 inference profile. https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa * test(bedrock): fix remaining e2e legacy-model + batch failures on new CI account - e2e_openai_endpoints: skip test_bedrock_batches_api (Bedrock batch inference is not authorized on account 941277531214) and migrate the missed s3_bucket_name in oai_misc_config.yaml to litellm-proxy-941277531214. - build_and_test: swap legacy bedrock claude-3-sonnet for the active us.anthropic.claude-sonnet-4-5 inference profile in the proxy structured output e2e test. https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa * test(bedrock): make opus-4-7 + batch cells fail loudly and mock image-gen (#28791) Replace the silent skips added for the new CI account with noisier behavior: - reasoning-effort grid: opus-4-7 cells now fail (when AWS creds are present) instead of skipping, so the missing entitlement stays visible in CI; they still skip when AWS creds are absent (local dev) - Bedrock batch inference tests: drop the skip so they run and fail until batch access is granted - Titan + Nova Canvas image-gen tests: mock the Bedrock HTTP call so the transform + cost-tracking path stays under test without live model access https://claude.ai/code/session_01MT7SWDnXUjv6e6EPG7BDjT Co-authored-by: Claude <noreply@anthropic.com> * test(bedrock): use pytest.xfail for known-failing opus-4-7 cells Replace pytest.fail with pytest.xfail when a model has a fail_reason, so known-broken cells stay visible as XFAIL without keeping CI red. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(otel): export SERVER span on management-endpoint success without http_request (#28794) Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MacBook-Pro.local> * chore(ci): merge dev branch (#28801) * chore(proxy): route path-dependent call sites through get_request_route Replace direct ``request.url.path`` reads in auth, ACL, routing, and audit-log decisions with ``get_request_route(request)`` — the helper already added in ``auth/auth_utils.py`` that returns the ASGI ``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs ``url.path`` from the Host header; ``scope["path"]`` is uvicorn's parse of the request line and matches what FastAPI dispatches on, so it's the authoritative route for any decision that should agree with the actual handler. Sites: - _experimental/mcp_server/auth/user_api_key_auth_mcp.py - management_endpoints/mcp_management_endpoints.py - vector_store_endpoints/utils.py - pass_through_endpoints/pass_through_endpoints.py - auth/route_checks.py - litellm_pre_call_utils.py - spend_tracking/spend_management_endpoints.py - common_utils/http_parsing_utils.py - management_helpers/utils.py - health_endpoints/_health_endpoints.py Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py that construct a Request with scope["path"] set to a benign route and the Host header crafted so url.path would resolve differently; each site's decision is asserted against scope["path"]. * chore(proxy): make get_request_route imports lazy at call sites Move the ``from litellm.proxy.auth.auth_utils import get_request_route`` imports added in the prior commit back to the function bodies that use them. The module-level form participates in a long-standing import cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL on the PR; the lazy form matches the pattern the proxy already uses for ``user_api_key_auth`` and related helpers elsewhere in these files. Also drop the ``RouteChecks._is_assistants_api_request`` delegation in ``_get_metadata_variable_name`` introduced in the prior commit — the delegation pulled ``RouteChecks`` into the same cycle, and the call site reuses the resolved route for its other branches, so inlining the substring check is both cycle-free and avoids a redundant second ``get_request_route`` call. Comment in test_proxy_routes.py acknowledges that the two MCP table entries exercise ``get_request_route`` directly rather than the full production handler (which needs ASGI scope + MCP state to invoke). --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> * chore(ci): merge dev branch (#28657) * feat(dashboard): navbar hierarchy + Agent Platform notifications (#27543) * feat(dashboard): refine navbar zones and Agent Platform notice Restructure the admin navbar for production users: clear product vs community vs personal columns with vertical dividers, icon-only Slack/GitHub in a shared chip, and Docs/Blog typography aligned on an 8px rhythm. Add a notifications bell with popover linking to the LiteLLM Agent Platform repo and optional mark-as-read persistence. Promote the account control with initials avatar, single-line display name, and navDisplayName mapping for placeholder user ids (e.g. default_user_id). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(dashboard): address PR review — AntD buttons, public page guard, dedupe regex - Replace raw <button> with AntD Button in BlogDropdown, NotificationsBell, UserDropdown, and test mock - Guard NotificationsBell + container behind !isPublicPage to avoid rendering on public pages - Remove redundant equality checks in navDisplayName (regex already covers them) - Remove unused `lower` variable after simplification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(dashboard): drop dead useHealthReadiness import in navbar The module was removed in #27896 (replaced by useHealthReadinessDetails), but the import survived the rebase. The symbol is unused — only useHealthReadinessDetails is consumed in the file. Removing the dead import unblocks the UI TypeScript build. * fix(dashboard): align CommunityEngagementButtons test with icon-only aria-labels The component was refactored to an icon-only chip with aria-label='LiteLLM on GitHub' (squash #27543), but the test still asserted /star us on github/i. Update the query to match the rendered accessible name. * refactor(dashboard): drop unused props from NavbarProps The navbar refactor moved user identity + dark-mode state to internal hooks (useAuthorized, useWorker), but the NavbarProps interface still declared userID, userEmail, userRole, premiumUser, isDarkMode, and toggleDarkMode as required, forcing every caller to thread them through. Drop them from the interface and all four call sites (page.tsx, (dashboard)/layout.tsx, public_model_hub.tsx, navbar.test.tsx). Also shrinks the destructure in layout.tsx so the now-unused locals stop being pulled out of useAuthorized(). * refactor(dashboard): use useSyncExternalStore for NotificationsBell dismiss flag Reads/writes of the litellmHideAgentPlatformBanner key were done directly inside NotificationsBell via a useEffect + useState pair. Every other localStorage-backed flag in the dashboard (Disable ShowPrompts, DisableBouncingIcon, DisableShowNewBadge, DisableUsageIndicator, DisableBlogPosts) is wrapped in a useSyncExternalStore hook over localStorageUtils so all mounted components stay in sync. Extract useHideAgentPlatformBanner to follow the same shape, swap NotificationsBell to consume it, and add a regression test that two sibling bells stay in sync without a remount when one is dismissed. * refactor: mask credential fields in proxy settings GET responses (#28682) * refactor: mask credential fields in proxy settings GET responses Brings SSO settings, cache settings, and the email/Slack alerting view in /get/config/callbacks in line with the HashiCorp Vault config-override pattern, so persisted credentials are not transported back to the UI in plaintext. * refactor: harden short-value masking and hoist alerting var constant Closes two review observations: - mask_sensitive_keys now replaces short values (below the visible prefix+suffix length) with an all-mask string instead of returning them unchanged, so a 1-7 character credential is no longer round-tripped verbatim. - _ALERTING_SENSITIVE_VARS is moved out of get_config() to a module-level constant, matching the analogous _SSO_SENSITIVE_FIELDS and _CACHE_SENSITIVE_FIELDS in the SSO and cache endpoint files. --------- Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * fix(ui): show 2-decimal precision for max_budget on key overview (#28809) The Key Info Overview tab's Spend card truncated sub-dollar budgets to "$0" because formatNumberWithCommas defaults to 0 decimals. The Settings tab passes 2; align the overview so a $0.10 budget renders as "$0.10". Resolves LIT-2845 * feat(proxy): allow `llm_api_routes` virtual keys to list MCP servers (#28442) * feat(proxy): allow llm_api_routes virtual keys to list MCP servers Add a new `mcp_discovery_routes` group (GET /v1/mcp/server and GET /v1/mcp/server/{server_id}) and include it in `llm_api_routes` so that virtual keys configured with `allowed_routes=["llm_api_routes"]` can discover the MCP servers they have access to. Previously these calls failed with 'Virtual key is not allowed to call this route. Only allowed to call routes: [llm_api_routes]'. The GET handlers already sanitize the response for restricted virtual keys via `_sanitize_mcp_server_list_for_virtual_key`, stripping credential-bearing fields (url, headers, env). Write methods (POST/PUT/DELETE) on the same paths remain gated by the existing handler-level admin role checks. The new discovery list is intentionally kept OUT of `mcp_inference_routes`, so `is_llm_api_route()` still returns False for these paths — this preserves the existing contract that DISABLE_LLM_API_ENDPOINTS must not block the Admin UI from listing MCP servers. Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> * refactor(proxy): make MCP discovery carve-out method-aware Replace the `mcp_discovery_routes` group in `llm_api_routes` with a method-aware special case inside `is_virtual_key_allowed_to_call_route`. Virtual keys with allowed_routes=["llm_api_routes"] are now permitted to call only GET /v1/mcp/server and GET /v1/mcp/server/{server_id} — non-GET methods and multi-segment admin sub-paths fall through to the existing 403. This keeps the general llm_api_routes list free of management paths and avoids accidentally exposing POST/PUT/DELETE writes through the route-check layer. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com> * chore(ci): merge dev branch (#28807) * chore(proxy): route path-dependent call sites through get_request_route Replace direct ``request.url.path`` reads in auth, ACL, routing, and audit-log decisions with ``get_request_route(request)`` — the helper already added in ``auth/auth_utils.py`` that returns the ASGI ``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs ``url.path`` from the Host header; ``scope["path"]`` is uvicorn's parse of the request line and matches what FastAPI dispatches on, so it's the authoritative route for any decision that should agree with the actual handler. Sites: - _experimental/mcp_server/auth/user_api_key_auth_mcp.py - management_endpoints/mcp_management_endpoints.py - vector_store_endpoints/utils.py - pass_through_endpoints/pass_through_endpoints.py - auth/route_checks.py - litellm_pre_call_utils.py - spend_tracking/spend_management_endpoints.py - common_utils/http_parsing_utils.py - management_helpers/utils.py - health_endpoints/_health_endpoints.py Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py that construct a Request with scope["path"] set to a benign route and the Host header crafted so url.path would resolve differently; each site's decision is asserted against scope["path"]. * chore(proxy): make get_request_route imports lazy at call sites Move the ``from litellm.proxy.auth.auth_utils import get_request_route`` imports added in the prior commit back to the function bodies that use them. The module-level form participates in a long-standing import cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL on the PR; the lazy form matches the pattern the proxy already uses for ``user_api_key_auth`` and related helpers elsewhere in these files. Also drop the ``RouteChecks._is_assistants_api_request`` delegation in ``_get_metadata_variable_name`` introduced in the prior commit — the delegation pulled ``RouteChecks`` into the same cycle, and the call site reuses the resolved route for its other branches, so inlining the substring check is both cycle-free and avoids a redundant second ``get_request_route`` call. Comment in test_proxy_routes.py acknowledges that the two MCP table entries exercise ``get_request_route`` directly rather than the full production handler (which needs ASGI scope + MCP state to invoke). --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: user <70670632+stuxf@users.noreply.github.com> * fix(team): keep team_alias cache in sync on _cache_team_object writes (#28737) * fix(team): keep team_alias cache in sync on _cache_team_object writes _cache_team_object wrote only to the team_id:<id> cache key, but the JWT auth path that uses team_alias_jwt_field reads from a separate team_alias:<alias> key (get_team_object_by_alias caches under both keys on miss, but reads only the alias-keyed one). After any team-mutation endpoint (team_model_add, team_model_delete, update_team, the two access-group writes) the team_id cache was refreshed but the team_alias cache stayed stale until TTL — JWT callers using team_alias_jwt_field kept seeing the pre-mutation team for the full cache window. Mirror the write under the alias key inside _cache_team_object so every existing caller stays in sync without further changes. Skip the alias write when team_alias is None/empty so we don't collide across alias-less teams. Surfaced testing the LIT-3244 cherry-pick on patch/1.86.0: the LIT-3244 fix correctly invalidated the team_id cache but the customer's JWT used team_alias_jwt_field, so they kept hitting the stale alias-keyed entry. * fix(team): delete (not overwrite) team_alias cache on _cache_team_object The prior shape of this PR wrote both team_id:<id> AND team_alias:<alias> from _cache_team_object. team_alias is NOT unique in the schema (no @unique on LiteLLM_TeamTable.team_alias), and get_team_object_by_alias enforces uniqueness on its own DB-fetch path (len(teams) > 1 raises). Writing the alias-keyed cache from the generic refresh path bypassed that check: a team admin renaming their team to collide with another team's alias could silently overwrite the cached team for JWT-by-alias auth, swapping the resolved team under that alias for the cache window. Switch the alias-keyed operation from a write to a delete (mirroring the dual-cache delete pattern in _delete_cache_key_object). After every team write, the next JWT-by-alias reader cache-misses and falls through to get_team_object_by_alias, which (a) re-fetches the fresh team from DB, closing the LIT-3244 staleness gap that motivated this PR, and (b) enforces alias uniqueness before populating either cache key. team_id:<id> writes are unchanged — team_id is the table PK and is guaranteed unique. Surfaced in veria-ai review on #28739. * fix(managed-files): anchor model_id regex so it doesn't match llm_output_file_model_id extract_model_id_from_unified_id used `re.search(r"model_id,([^;]+)", ...)` which substring-matches the `model_id,` inside the file-ID encoding's `llm_output_file_model_id,<deployment_uuid>` field. parse_unified_id then fed that deployment UUID back into the auth path as a model candidate via _extract_models_from_managed_resource_id, and every team-BYOK file attach 403'd with: team not allowed to access model. This team can only access models=['openai/*']. Tried to access <deployment-uuid> The team's models list correctly contains the public name (`openai/*`) that target_model_names matches, but the bogus UUID candidate fails the wildcard check first. Anchor the regex to a field boundary (`(?:^|;)model_id,`) so it matches the legitimate top-level `model_id,<value>` field on vector_store unified IDs and skips substring matches inside other fields. File-IDs (which have no top-level `model_id` field) now return None and contribute no spurious UUID candidate. Surfaced reproducing LIT-3244 on patch/1.86.0 with the customer's exact flow: team with openai/* BYOK deployment, JWT-scoped user, POST /v1/vector_stores/{id}/files attaching a file uploaded with target_model_names=openai/gpt-4o. * fix(proxy): hydrate wildcard discovery credentials (#28284) (#28822) * fix(proxy): hydrate wildcard discovery credentials * fix(proxy): constrain wildcard credential hydration Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com> * ci: add daily oss-agent-shin bra…
…8891) Deletes 39 files across 12 unused subdirs under src/app/(dashboard)/. None of these routes are reachable: LEGACY_REDIRECTS in src/app/page.tsx is empty, the live UI renders everything via the legacy ?page=X switch, and no other code statically imports from these dirs. Kept api-reference, models-and-endpoints, organizations, playground, virtual-keys because they ARE imported by the legacy switch in src/app/page.tsx — they will be migrated properly in the App Router migration. Kept shared infra: layout.tsx, hooks/, components/, networking.ts, README.md — these are imported by live code. Clean baseline before the App Router migration (LIT-3128).
#28888) * fix(docker): use system Node in componentized builders + retry apk add Two failure modes in the componentized image builds (backend, migrations, gateway) on project-releaser, with the same root cause: 1. The builder-stage `apk add` was missing `libatomic`. `prisma generate` triggers prisma-client-py's `nodeenv`, which downloads the latest stable Node.js at build time. Node 26.1.0 (last passing build on 2026-05-20) did not dynamically link `libatomic.so.1`. Node 26.2.0 (current latest) does, and the Wolfi builder doesn't ship libatomic — so `npm install prisma@…` fails with `node: error while loading shared libraries: libatomic.so.1` and exit 127. Retrying or pinning the Node version is a treadmill; the root issue is that nodeenv decides the Node version at build time. Fix: add `nodejs npm` to the builder-stage `apk add` so prisma-client-py uses Wolfi's own Node via its default `PRISMA_USE_GLOBAL_NODE=true`. The legacy `docker/Dockerfile.non_root` already does this; the componentized Dockerfiles regressed it. Setting `PRISMA_USE_GLOBAL_NODE=true` in ENV redundantly nails the intent so a future env override can't silently re-enable nodeenv's download. 2. Transient `apk.cgr.dev` mirror flakes during the arm64 leg of multi-arch builds cause individual package fetches to fail mid-install (we saw `nss-db-2.43-r7: remote server returned error (try 'apk update')` and similar for libzstd1, libogg, binutils in this run). None of the componentized Dockerfiles wrap `apk add` in a retry loop. Fix: wrap every `apk add` (builder + runtime, all three files) in the same `for i in 1 2 3; do … && break || sleep 5; done` loop that the legacy `docker/Dockerfile.non_root` already uses. Affected files all have the same shape — backend, migrations, gateway — because they're three near-identical componentizations of the original monolithic proxy Dockerfile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(docker): trim verbose comments on builder Node setup Same fix, leaner comments. The apk-add note is 3 lines now (was 8), and the PRISMA_USE_GLOBAL_NODE bullet matches the existing UV_* comment style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): make apk-add retry loop fail loudly on exhaustion Greptile flagged that the retry pattern `apk add ... && break || sleep 5` exits 0 when all three attempts fail, because `sleep 5` is the last executed command. A persistent apk.cgr.dev outage would produce a silently "successful" RUN layer with no packages installed, followed by cryptic "command not found" errors in downstream RUN steps. Fix: explicitly fail on the third miss before sleeping. Same pattern in all six retry loops (3 files × builder + runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MBP.localdomain> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…28908) Adds a rule to CLAUDE.md and AGENTS.md instructing AI coding agents to pause and ask the user before introducing any third-party organization name that does not already appear in the repository. Names already established in the codebase (existing LLM providers, etc.) remain fine to use without prompting.
* refactor(ui): extract auth state into AuthContext
Move auth state (token, userID, userRole, accessToken, premiumUser, userEmail,
disabledPersonalKeyCreation, showSSOBanner) out of src/app/page.tsx into a
new AuthProvider at src/contexts/AuthContext.tsx. Wrapped at the root layout
so login/onboarding/dashboard routes all have access via useAuth().
Day 1 foundation for the App Router migration: migrated (dashboard)/X/page.tsx
route entry points won't have a parent passing props, so shared auth state
must live in a context they can read from.
Sub-components are unchanged — they still receive accessToken/userID/userRole
as props from page.tsx (which now reads them from useAuth()). Only the
page.tsx → top-level-page-component handoff is de-drilled; deeper prop
drilling is left for the per-page migration to address.
Net change: -86 lines from page.tsx (state + two effects moved), +5 in
layout.tsx (provider wrap), new AuthContext.tsx (~140 lines), test update
to wrap CreateKeyPage in AuthProvider.
Fixes LIT-3366
Part of LIT-3128
* fix(ui): await getUiConfig before clearing authLoading
The AuthContext refactor flipped authLoading to false synchronously on mount
while letting getUiConfig() run fire-and-forget. On SERVER_ROOT_PATH deployments
this races the unauthenticated login-redirect effect: the redirect fires with
proxyBaseUrl still at its module-init value, sending users to /ui/login instead
of {SERVER_ROOT_PATH}/ui/login.
Restores the original sequencing inside AuthProvider's mount effect and adds a
Playwright spec wired into the existing SERVER_ROOT_PATH workflow matrix. The
spec delays the config endpoint via page.route() to make the race deterministic
across CI runners.
* fix(mcp): resolve team.access_group_ids → MCP servers A virtual key whose team has an MCP-granting access group attached via /v1/access_group now sees that server through /v1/mcp/server (and can call tools on it) instead of getting an empty list. The runtime already resolves the key's unified access_group_ids; this adds the symmetric resolution on the team side, mirroring the model-side pattern in can_team_access_model — the group being on the team is itself the gate, so no assigned_team_ids re-check is needed. Resolves #27657 * chore(mcp): address greptile review on team access-group resolver Forward already-imported prisma_client / user_api_key_cache / proxy_logging_obj to _get_mcp_server_ids_from_access_groups so it skips its lazy re-import path. Update test docstring + assertions to reflect that the resolver is invoked with [] (and short-circuits without DB access) rather than skipped entirely.
* test(ui): e2e cover team model edit + admin identity in navbar
Adds two Playwright tests as part of the manual-QA → e2e migration:
"Edit team model selection" exercises the Settings tab Models multi-select
+ Save Changes flow on a seeded team, and the existing login test now
opens the User dropdown and asserts the role and User ID render — guarding
against regressions where login succeeds but the auth context is empty.
Resolves LIT-3093
* test(ui): restore seeded models in team-edit test so retries don't fail
The 'Edit team model selection' test removed fake-anthropic-claude from
E2E_TEAM_CRUD_ID without restoring it. CI runs with retries: 2 and the seed
script runs once before the suite, so a flake on this test would fail the
retry at the "tag is visible" assertion. Wrap the test in try/finally and
restore the seeded models via /team/update before and after.
* test(e2e): fail loudly if team/update restore call fails
Surfaces the real cause when the master key is wrong or the proxy is
unreachable, instead of silently leaving the team in a stale state and
failing later on the visibility assertion.
* fix(e2e): match navbar account button by aria-label, not non-existent "User" text
The previous trigger filter (hasText: /^User$/) didn't match the rendered
UserDropdown button — its text is the displayName ("Account" for the
master-key admin, an email for SSO users), never "User". The evaluate
call then timed out after 15s in CI. Use the stable aria-label prefix
the component always emits, and click directly since the dropdown is
configured trigger=["click"] (the synthetic hover was unnecessary).
* test(e2e): cover add-fallback flow in Router Settings as proxy admin The Router Settings → Fallbacks → Add Fallbacks flow was an uncovered manual-QA path. This adds a test that opens the modal, picks a primary + fallback from the seeded mock models, saves, and verifies both render in the fallback table. * fix(e2e): make router-fallback test idempotent and pick antd options by text - Match `.ant-select-item-option` by text instead of `getByTitle(...)` — FallbackGroupConfig uses `options=` (not <Select.Option> children), so no `title` attribute is emitted and the title-based selector hangs. - Add before/after hooks that wipe any fallback for fake-openai-gpt-4 via /config/update so retries and local reruns don't trip on leftover state. - Tighten the success assertion to a single tbody row containing BOTH the primary and the fallback names — pre-existing rows can no longer vacuously satisfy the check. - Fix the stale "Three tabs" comment to "Four tabs". Addresses Greptile P2s on PR #29069. * fix(e2e): keyboard-select fallback models + correct cleanup endpoint - Replace mouse-based option clicks with click-to-focus + type + Enter. FallbackGroupConfig's Selects use `options=` and a custom getPopupContainer, so locating options via `.ant-select-dropdown` hit several races: DOM-clicks left antd's popup state stale (the primary popup then intercepted the fallback click), `getByRole` matched always-mounted hidden options, and pointer stability fought the open animation. Typing into the showSearch input narrows the listbox to one option and Enter selects it cleanly. - Assert on dialog-side state changes (the active tab adopts the primary model name; the chain helper shows "1/10 used") instead of popup contents — these reflect the actual selection landing. - Cleanup helper now hits /get/config/callbacks (the real endpoint; /get/callbacks returns 404), so the before/after reset actually clears prior router_settings.fallbacks state.
* test(e2e): cover Team-BYOK add-model flow as proxy admin The team-only model + team assignment was an uncovered manual-QA path. This adds a premium-gated test that toggles Team-BYOK, picks the seeded E2E Team CRUD, submits, and verifies the model lands in All Models with the team alias attached. * test(e2e): apply greptile fixes to Team-BYOK test - Add the 2s networkidle settle that the sibling addModel tests use — networkidle fires before the All Models table finishes re-rendering, so the search input was racing with the render. - Assert on `models-results-count` before inspecting the table body so an empty search result fails with a clear "expected results count" message instead of timing out on a missing row. Addresses Greptile P2s on PR #29068. * test(e2e): harden Team-BYOK test against flake and stale state - Add before/after cleanup that deletes any Cohere model already scoped to e2e-team-crud via /v2/model/info + /model/delete, so Playwright retries and local reruns don't accumulate rows. - Pick the team from the dropdown by role/option name instead of a global getByText match — avoids matching a previously-rendered tag elsewhere in the form. - Scope the "created successfully" assertion to .ant-notification so a stale toast from an earlier test in the same browser context can't vacuously satisfy it. - Tighten the All Models assertion: require a single row that contains BOTH the cohere model name AND the e2e-team-crud alias, so the team-less wildcard from the sibling "Add wildcard route" test can't satisfy the check.
…ma Json serialization (#28990) * fix(containers): record ownership for service-account keys + fix Prisma Json field serialization - Track containers created implicitly via /v1/responses by extracting container IDs from the response output and calling record_container_owner for each one, so subsequent file-API calls from the same service account pass ownership checks. - Fix DataError: Prisma Python requires Json fields to be JSON strings; serialize file_object with json.dumps() before insert/update in LiteLLM_ManagedObjectTable. - Add collect_container_ids_from_responses_response utility to responses/utils.py that walks all output item shapes (code_interpreter_call, message annotations). - Tests: two new cases covering the responses-tracking path and the end-to-end record-then-assert flow for service accounts with team scope. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(containers): swallow all exceptions in ownership hook; tighten file_object_json type to str Co-authored-by: Cursor <cursoragent@cursor.com> * fix(containers): parse file_object JSON string in existing ownership test Co-authored-by: Cursor <cursoragent@cursor.com> * fix: container ownership recording bugs - Remove unreachable _aresponses_websocket from route_type set in base_process_llm_request; the WebSocket endpoint never flows through base_process_llm_request, so this branch was dead code that gave a false impression of coverage. - Drop the HTTPException re-raise in record_container_owners_from_responses_response so per-container failures (including HTTP 403/500 from conflicting ownership rows) no longer abort the batch and skip recording for the remaining container IDs in the same response. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(containers): record ownership for streaming /v1/responses too Streaming /v1/responses returns through the select_data_generator branch in base_process_llm_request and bypasses the non-streaming ownership tail, so code-interpreter containers created mid-stream were never written to LiteLLM_ManagedObjectTable. Follow-up file API calls would then 403. Wrap the SSE generator so container ownership is recorded once the upstream iterator finishes assembling completed_response. Also covers the background-polling path, which loops body_iterator end-to-end. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
) * test(e2e): cover add-MCP-server flow via discovery → custom form The "Add MCP server" manual-QA step was uncovered. This adds a test that opens the discovery modal, jumps into the custom-server form, fills name + Streamable HTTP transport + a placeholder URL + None auth, submits, and verifies both the success toast and the new row. * test(e2e): apply greptile fixes to MCP add-server test - Anchor the auth-type Select via its enclosing Collapse panel ("Authentication") instead of the placeholder text. The Form.Item has no label prop, so the previous `hasText: /auth type/i` filter was matching via "Select auth type" placeholder copy — fragile. - Document the intentional lack of teardown, matching the pattern used in addModel.spec.ts: the e2e runner discards the DB per invocation. Addresses Greptile P2s on PR #29070. * test(e2e): scope MCP row assertion to the servers table Scope the post-create row lookup to `table tbody` so the form modal's `server_name` input — which still holds the timestamped value during its close animation — can't satisfy the assertion before the server actually lands in the list. * docs(e2e): note MCP coverage scope and link to tracker This spec only smoke-tests the happy-path Streamable HTTP + None auth flow. Add a top-of-file comment pointing at E2E_COVERAGE.md so future contributors can see what's still uncovered (other transports, all auth types, edit/delete, BYOK, tool list/call, access groups).
…29071) * test(e2e): cover AI Hub make-public flow and public model_hub_table Three previously-uncovered manual-QA paths land in one spec: - Admin opens "Select Models to Make Public", advances through the multi-step modal, and verifies the success toast. - AI Hub tab strip exposes Model Hub / Agent Hub / MCP Hub / Skill Hub — note the manual-QA "Claude Code Plugin Marketplace" label was renamed to Skill Hub; the test pins the current name. - Anonymous /ui/model_hub_table loads with the master key as `?key=` and renders the Model Hub tab. Agent Hub / MCP Hub tabs are conditional on public data and are not asserted here. * test(e2e): harden AI Hub make-public + public hub assertions Address Greptile review: - Make-public test now asserts "Select All (N)" with N>=1 before clicking, so a missing-seed-data run surfaces immediately instead of timing out on the disabled Next button or the success toast. - Public model_hub_table test dismisses the feedback popup before the tab visibility assertion, matching the ordering used by navigateToPage so a popup race can't mask the tab mid-evaluation. * docs(e2e): explain admin vs public AI Hub tab asymmetry Greptile flagged the all-4-tabs assertion as a potential CI flake, inferring from the public-page comment that Agent Hub / MCP Hub might be data-conditional in the admin view too. They aren't — ModelHubTable renders all four tabs unconditionally for admins. Document the asymmetry inline so future readers (and future review passes) don't re-derive it.
…onfig (#28898) * feat: support goal mode for claude on bedrock * fix failing lint test * addressing greptile comments * fixing failed test * address greptile: copy output_config and warn on dropped converse format * fix(bedrock): skip redundant output_config normalization on Converse reasoning_effort path When reasoning_effort is mapped via _handle_reasoning_effort_parameter, the resulting output_config is already normalized via normalize_bedrock_opus_output_config_effort. Mark it as normalized so _prepare_request_params can skip the redundant call (and the associated get_model_info lookup) on every request. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(reasoning-effort-grid): reflect Bedrock opus-4-6 xhigh→max clamping * fix(bedrock): stop leaking output_config marker and message-content mutation * fix(bedrock): guard effort key access in normalize_bedrock_opus_output_config_effort Defensively check that 'effort' is a valid key in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER before indexing, to prevent a KeyError if the hardcoded guard tuple ever drifts from the order dict's keys. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): drop dead second clause in effort normalization guard The 'effort not in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER' check is unreachable once 'effort not in ("xhigh", "max")' has been ruled out, since both literals are present in the order dict. Keep the literal membership check and let the dict lookups below speak for themselves. * fix(bedrock): clamp output_config.effort against ceiling for any known value The early return when effort was not 'xhigh'/'max' meant a ceiling of 'low' or 'medium' would silently forward an out-of-range value. Gate on the known effort ordering instead so the ceiling comparison runs for every recognized effort. * test(grid_spec): use _CAPS_OPUS_4_7 for non-Bedrock opus-4-6 entries claude-opus-4-6 now declares supports_xhigh_reasoning_effort in the model map, so production accepts xhigh on Azure AI and Vertex AI routes. Update those grid_spec entries to match production capabilities so expected() predicts 200 for xhigh instead of 400. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(grid_spec): revert xhigh caps for non-Bedrock opus-4-6 azure_ai/claude-opus-4-6 and vertex_ai/claude-opus-4-6 do not declare supports_xhigh_reasoning_effort in model_prices_and_context_window.json. Azure AI upstream rejects xhigh with HTTP 400 ("Supported levels: high, low, max, medium"). Restore _CAPS_4_6 so the grid predicts 400 for xhigh, matching production capabilities. * fix: stop advertising xhigh effort on Opus 4.5/4.6 Only Opus 4.7 supports the xhigh reasoning effort level. Remove the supports_xhigh_reasoning_effort flag from every Opus 4.5 and Opus 4.6 entry (direct Anthropic, Bedrock, and regional variants) in both model catalog files. On the direct Anthropic path there is no effort clamp, so flagging 4.5/4.6 as xhigh-capable caused litellm to forward xhigh to a model that rejects it (and made get_model_info misreport the capability). xhigh now correctly degrades to high / raises on those models. Bedrock graceful degradation for Claude Code goal mode is unaffected: it relies solely on the bedrock_output_config_effort_ceiling clamp (4.5->high, 4.6->max, 4.7->xhigh), which runs before validation, so xhigh requests to older Bedrock Opus models are still silently lowered rather than rejected. Update effort-gating tests to reflect that 4.5/4.6 no longer accept xhigh. * fix: clamp xhigh effort on Bedrock Invoke /v1/messages instead of rejecting Claude Code "goal mode" sends output_config.effort=xhigh over the Anthropic /v1/messages API, which routes Bedrock models through AmazonAnthropicClaudeMessagesConfig. That path validated effort against the model's native capability and raised 400 for xhigh on Opus 4.6, while the chat-completions paths (Converse + Invoke) already clamp xhigh to the model's bedrock_output_config_effort_ceiling. That asymmetry broke goal mode on the exact API surface Claude Code uses. Apply the same ceiling clamp on the messages path before the shared effort gate runs, so xhigh degrades to max on Opus 4.6 (and stays xhigh on 4.7). Scoped to adaptive-thinking models and to models that declare a ceiling, so Sonnet 4.6 (no ceiling) and Opus 4.5 (budget mode) are unaffected and still reject xhigh. * fix(bedrock): preserve user output_config when applying reasoning_effort - Converse path: merge mapped effort into existing output_config via setdefault instead of overwriting it, matching the Anthropic Messages path. Prevents user-supplied output_config.format from being silently dropped when reasoning_effort is also provided. - tests: clear _get_local_model_cost_map lru_cache in the autouse fixture alongside get_bedrock_response_stream_shape to avoid stale cache leakage between tests. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): pre-clamp reasoning_effort for chat invoke; correct test caps - Add _clamp_adaptive_reasoning_effort_for_bedrock to AmazonAnthropicClaudeConfig so raw reasoning_effort=xhigh degrades to the model's bedrock effort ceiling before AnthropicConfig.map_openai_params converts it to output_config. Mirrors converse path (_handle_reasoning_effort_parameter) and messages path (_clamp_adaptive_reasoning_effort_for_bedrock) so the three Bedrock paths are consistent. - grid_spec: restore caps=_CAPS_4_6 for Bedrock converse/invoke Opus 4.6 entries so the test reflects the model's actual JSON capabilities. Teach expected() to bypass the xhigh/max cap check when bedrock_effort_ceiling will clamp the wire effort, so the test still passes for Bedrock's graceful degradation contract without lying about native model caps. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Dennis Henry <dennis.henry@okta.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
…28970) * feat(guardrails): wire apply_guardrail into proxy logging callbacks Route /apply_guardrail through pre/post proxy hooks and LiteLLM success/failure handlers so Langfuse and OTEL integrations receive input/output on guardrail-only requests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails): fix Greptile review comments on apply_guardrail logging Co-authored-by: Cursor <cursoragent@cursor.com> * fix(apply_guardrail): preserve original exception and capture modified response - Capture return value from post_call_success_hook so callback-modified responses propagate to the caller. - Wrap success/failure logging calls in defensive try/except so logging infrastructure failures don't replace the user-visible response or mask the original guardrail exception. Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix mypy * fix(apply_guardrail): isolate failure logging and use post-hook response for logging - Split async_failure_handler and post_call_failure_hook into independent try/except blocks so a callback bug in one does not silently skip the other. - Build response_for_logging inside _emit_guardrail_success_logs after post_call_success_hook runs, so logged data matches the response the caller actually receives when the hook modifies the response. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(apply_guardrail): fix black formatting and update tests for fastapi_request param - Run black on guardrail_endpoints.py to fix CI formatting check - Add _mock_proxy_logging() helper to enterprise guardrail tests to patch proxy-server globals imported at call time - Pass fastapi_request=Mock() in all direct apply_guardrail test calls to match updated function signature Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails): use transformed exception from post_call_failure_hook in apply_guardrail Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(guardrails): isolate sync/async logging handlers in apply_guardrail Separate each logging handler call into its own try/except so a failure in the async handler does not silently skip the sync handler submission (and vice versa). Matches the docstring's defensive intent. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(apply_guardrail): guard transformed_exception with isinstance check Co-authored-by: Cursor <cursoragent@cursor.com> * test(guardrails): mock proxy globals in not_found test and share apply_guardrail logging fixture - Add proxy-server global mocks to test_apply_guardrail_not_found so the failure-path post_call_failure_hook call doesn't touch the real proxy logging singleton. - Extract the duplicated _mock_proxy_logging context manager out of the two enterprise apply_guardrail test files into a shared conftest fixture so the helper stays in one place. * fix(guardrails): use update_messages to keep logging obj in sync Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
* build(deps): bump next from 16.2.4 to 16.2.6 in /ui/litellm-dashboard (#27665) Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](vercel/next.js@v16.2.4...v16.2.6) --- updated-dependencies: - dependency-name: next dependency-version: 16.2.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump protobufjs in /tests/pass_through_tests (#28296) Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.6 to 7.6.0. - [Release notes](https://github.com/protobufjs/protobuf.js/releases) - [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.6.0/CHANGELOG.md) - [Commits](protobufjs/protobuf.js@protobufjs-v7.5.6...protobufjs-v7.6.0) --- updated-dependencies: - dependency-name: protobufjs dependency-version: 7.6.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump ws from 8.20.0 to 8.20.1 in /tests/pass_through_tests (#28303) Bumps [ws](https://github.com/websockets/ws) from 8.20.0 to 8.20.1. - [Release notes](https://github.com/websockets/ws/releases) - [Commits](websockets/ws@8.20.0...8.20.1) --- updated-dependencies: - dependency-name: ws dependency-version: 8.20.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix(proxy): enforce tag budgets for key-level tags Merge API key metadata.tags into request_data before _tag_max_budget_check so per-tag budgets apply when tags are set on the key at creation time. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(auth): avoid false reject for key-inherited tags Run reject_clientside_metadata_tags before key-tag injection, then inject key metadata tags immediately before tag budget checks so key tags still enforce budgets without being treated as client-supplied tags. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
…video edit (#29098) * fix(vertex-ai): pass litellm_params to validate_environment in video handlers and implement video edit for Veo - Pass litellm_params to validate_environment in 11 video handler call sites (remix, create_character, get_character, edit, extension, delete) so DB-stored Vertex AI credentials are used instead of falling back to ADC - Implement transform_video_edit_request/response for VertexAI: fetches source video via fetchPredictOperation then submits a new predictLongRunning request with the video bytes/gcsUri + edit prompt Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex-ai): hoist fetchPredictOperation into handlers to avoid blocking event loop - Add get_video_edit_prefetch_params() to BaseVideoConfig (returns None) - VertexAI overrides it to return the fetchPredictOperation URL/body - Both sync and async video_edit handlers call this and use their shared httpx client for the fetch, passing the result as prefetched_source_data - transform_video_edit_request is now a pure transform with no HTTP calls - Fix extra_body.pop() mutation by working on a shallow copy Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex-ai): include prefetch call inside _handle_error try/except block Co-authored-by: Cursor <cursoragent@cursor.com> * fix(videos): add prefetched_source_data param to all transform_video_edit_request overrides Co-authored-by: Cursor <cursoragent@cursor.com> * fix(video_edit): keep transform/pre_call outside try so validation errors propagate Move transform_video_edit_request and logging_obj.pre_call outside the try/except that wraps HTTP calls in (async_)video_edit_handler so that ValueError validation errors (e.g. 'source video not complete yet') are not silently wrapped as 500s by _handle_error. The prefetch HTTP call keeps its own try/except so its errors are still mapped through the provider's error handler. Matches the pattern used by video_extension_handler and video_remix_handler. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(vertex_ai): delegate get_video_edit_prefetch_params to status retrieve Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix varia review * fix(video_edit): route transform errors through _handle_error Wrap transform_video_edit_request and pre_call in the same try/except as the HTTP call in sync and async handlers so validation failures (e.g. source video not complete) return typed LiteLLM exceptions. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>
…st (#28487) * fix(datadog): drain cost-management queue + opt-in FinOps tag allowlist * fix(datadog): guard non-dict callback_specific_params + log empty aggregation * fix(datadog): block user-controlled tags from overwriting reserved cost-attribution dimensions * fix(datadog): cast metadata to dict[str, Any] to satisfy mypy
… and UI (#28712) * feat(helm): split per-component ServiceAccounts for gateway, backend, and UI Replace the single shared serviceAccount with three separate serviceAccounts (gateway, backend, ui) so operators can attach different IRSA / Workload Identity annotations per component without granting data-plane credentials to the UI pod. Key changes: - values.yaml: rename serviceAccount → serviceAccounts with gateway/backend/ui sub-keys; UI defaults to automount: false - _helpers.tpl: replace litellm.serviceAccountName with three component-scoped helpers (litellm.gateway/backend/ui.serviceAccountName) - serviceaccount.yaml: create up to three separate ServiceAccount objects with component labels and per-SA automountServiceAccountToken - gateway/backend deployments: use their respective SA helpers - ui deployment: use litellm.ui.serviceAccountName + explicit automountServiceAccountToken: false on the pod spec so the projected token is absent even when the SA itself allows it - migrations-job: share the backend SA (both need DB write access) Resolves LIT-3171 https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF * fix(helm): enforce automountServiceAccountToken on all pod specs; fix leading --- in serviceaccount.yaml - gateway/backend deployments: add explicit automountServiceAccountToken on the pod spec so serviceAccounts.*.automount is honoured regardless of whether the SA is chart-created or operator-supplied (previously the flag only took effect on the SA object when create: true, creating an asymmetry with the UI which already enforced it at pod-spec level) - serviceaccount.yaml: use a $prev sentinel to emit --- only between documents, preventing a leading --- when gateway SA is skipped but backend or ui SA is created (avoids lint/GitOps warnings from strict YAML parsers and tools like ArgoCD) https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF --------- Co-authored-by: Claude <noreply@anthropic.com>
* fix(deps): bump vulnerable proxy dependencies (starlette/fastapi, granian, pyarrow, semantic-router) Resolve known CVEs flagged by osv-scanner/grype against uv.lock. All bumped versions verified to resolve, install, and pass the proxy auth/route/middleware unit suites (717 tests) plus an import smoke on the new stack. - starlette 0.50.0 -> 1.1.0 (CVE-2026-48710 "BadHost", GHSA-86qp-5c8j-p5mr): versions <1.0.1 reconstruct request.url from the unvalidated Host header, poisoning request.url.path. Required raising fastapi 0.124.4 -> 0.136.3, which dropped fastapi's starlette<0.51.0 cap; an explicit starlette>=1.0.1 floor blocks regression to a vulnerable transitive resolution. The proxy's own auth already reads scope["path"] via get_request_route, but the locked starlette still flagged in container scanners and left other request.url consumers exposed. - granian 2.5.7 -> 2.7.4 (CVE-2026-42544, unauthenticated DoS via WebSocket subprotocol header panic; CVE-2026-42545, WSGI response-header-panic DoS). granian is a selectable proxy server (proxy_cli). - pyarrow 22.0.0 -> 23.0.1 (CVE-2026-25087 / PYSEC-2026-113). - semantic-router 0.1.12 -> 0.1.15: 0.1.12 was yanked (CVE-2026-42208 — its unbounded litellm pin could resolve a credential-exfiltrating litellm==1.82.8 wheel). Not fixable by bump: diskcache 5.6.3 (CVE-2025-69872, unsafe pickle deserialization) has no upstream fix and is left pinned; exploiting it requires write access to the local cache directory. Relock side effect: sse-starlette 3.4.2 -> 3.4.4. * deps: relax exact pins in optional extras to compatible ranges The proxy/optional extras exact-pinned every dependency, which (1) forces downstream `pip install litellm[proxy]` consumers into version lockstep and (2) blocks them from pulling transitive security patches without forking — the structural cause behind needing a litellm release to clear the starlette CVE in the previous commit. Convert the ordinary extras deps to `>=current,<next_major` ranges, mirroring the core [project].dependencies style. Reproducibility for litellm's own Docker/CI is unaffected: images install via `uv sync --frozen`, and the lock re-resolves to the identical versions (no locked version changed). Kept exact-pinned: - litellm-proxy-extras, litellm-enterprise — litellm's own sub-packages, versioned in lockstep with the release. - opentelemetry-api/sdk/exporter-otlp — must resolve to matching versions. - grpcio — supply-chain-pinned to a vetted, aged release. Also corrects the stale comment claiming the extras are exact-pinned for Docker reproducibility (the images use the lock, not these pins). * fix(ci): resolve license-check lookup version from the floor for ranged deps check_licenses.py derived the PyPI lookup version with `next(iter(req.specifier))`, which returns an arbitrary specifier clause. For a range like `>=0.12.1,<1.0` it picked the upper bound (`1.0`) — a version that doesn't exist on PyPI — so the license lookup 404'd and the package was flagged as having an unknown license. The previous commit's switch from exact pins to ranges exposed this for soundfile, pyroscope-io, redisvl, diskcache, and mlflow (the ranged deps not already in liccheck.ini's allowlist). Prefer a lower-bound/exact version (a real released version) for the lookup. * fix(proxy): set strict_content_type=False on the FastAPI app Starlette 1.0 / FastAPI 0.13x flipped the default to strict_content_type=True, which refuses to parse a JSON request body when the client omits the Content-Type header. The proxy previously accepted those requests, so the fastapi/starlette bump in this PR would silently break clients that don't send a Content-Type. Restore the prior lenient behavior explicitly. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
…replay (#29229) The Redis-backed VCR layer was recording and replaying the Google OAuth2/STS token-mint call. The replayed ya29.* access token is long-expired, but its recorded expires_in keeps credentials.expired False, so litellm never refreshes it and sends the stale token to a live Vertex/Gemini endpoint, which returns 401 ACCESS_TOKEN_EXPIRED. This broke live partner-model tests whose completion call is not itself cassette-backed (e.g. test_vertex_ai_llama_tool_calling). Force credential-exchange hosts to pass through live (never recorded, never replayed) by returning None from before_record_request, mirroring the existing telemetry passthrough, so a fresh token is minted each run. Regression from #28826, which added OAuth-token matcher tolerance plus TTL-refresh-on-read so a stale token episode matched and never expired.
Updates the gollem_go_agent_framework example to the current Go release. Clears stale Go stdlib advisories reported by osv-scanner against the older 1.25.1 directive. No source changes; the single pinned dependency (gollem v0.1.0) is backward compatible.
* bump: version 1.87.0 → 1.88.0 * uv lock
…#29238) * feat(anthropic): add Claude Opus 4.8 and prune reasoning-effort flags Register claude-opus-4-8 across the anthropic/bedrock/vertex/azure cost-map entries, BEDROCK_CONVERSE_MODELS, and the setup-wizard provider list. Prune two reasoning-effort fields from the cost map: - Drop supports_minimal_reasoning_effort from the Claude fleet (58 entries). "minimal" is not a real Anthropic effort level (the API accepts only low/medium/high/xhigh/max), so LiteLLM degrades it to "low" regardless; the flag was inert and misleading on Anthropic. - Remove tool_use_system_prompt_tokens everywhere (103 entries). It is not in the ModelInfo type and is read by no production code. Update the affected config/schema tests; the reasoning-effort registry tests now assert the Claude fleet omits supports_minimal. * fix(anthropic): recognize output_config effort after minimal-flag prune Pruning supports_minimal_reasoning_effort from the Claude fleet removed the only "supports effort param" marker from 11 Opus 4.5 / mythos-preview map entries that lack supports_output_config. _model_supports_effort_param then returned False for them, so output_config was wrongly dropped under drop_params=True -- regressing test_anthropic_model_supports_effort_param_recognizes_supporting_models for claude-opus-4-5-20251101 and the mythos preview. - _model_supports_effort_param now treats supports_output_config as a sufficient signal, matching the bedrock-invoke call sites that already check supports_output_config OR a reasoning-effort flag. Shared map lookup extracted into _supports_model_capability. - Add supports_output_config: true to the 11 Opus 4.5 / mythos entries that lost their only marker, restoring prior effort-forwarding behavior without re-adding the inert minimal flag.
chore(ci): promote internal staging to main
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unitCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes