-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Pedro/fix vertex cache leak #4135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Summary - **Scope Vertex cache entries per prompt variant** - Extract-actions now tags cached static prompts with `"std"` vs `"vc"` and includes that variant in the Vertex cache key (e.g. `extract-action-static-vc-<llm_key>`). - `SkyvernContext` keeps per-prompt cache entries (`PromptCacheEntry`) that track the static prompt, active variant, and model→cache-name map, so multiple prompts/models can safely opt into caching simultaneously. - `ForgeAgent` clears the extract-actions cache entry after every call (including OTP retries) to avoid leaking the static prompt into later GPT requests. - Router and single-model handlers look up cached content via the prompt entry and attach it only when the current prompt/model/variant has a cache. Router calls wrap cache attaching/removal inside a lock so the cache flag can’t stick around for other prompts. - **Prompt-caching experiments/docs/tests** - Updated scripts, docs, and tests to use the new prompt cache entry APIs instead of the old `cached_static_prompt`/`vertex_cache_name` flags. - Router fallback tests now snapshot the model list during the call to confirm only the primary carries `cached_content`. - **LLM handler refactor** - Added helper functions for logging artifacts and applying the OpenAI context-cache system message so the top-level handlers are easier to follow. ## Testing - `uv run pytest tests/unit/test_api_handler_factory.py tests/unit/test_llm_router_fallback.py tests/unit/test_api_handler_cached_content_fix.py` <!-- ELLIPSIS_HIDDEN --> ---- > [!IMPORTANT] > Fix Vertex cache leak by scoping cache entries per prompt variant and ensuring caching is only attached to Gemini models. > > - **Behavior**: > - Vertex cache entries are now scoped per prompt variant, preventing cache leaks. > - `ForgeAgent` clears extract-actions cache after each call to avoid static prompt leaks. > - Caching is only attached to Gemini models, not fallback models. > - **Context and Caching**: > - `SkyvernContext` now includes `vertex_cache_key`, `vertex_cache_variant`, and `prompt_caching_settings`. > - `LLMAPIHandlerFactory` updated to handle prompt caching settings and log cache hits. > - **Testing and Documentation**: > - Tests updated to verify caching behavior for Gemini and non-Gemini models. > - Documentation updated to reflect new caching API and behavior. > > <sup>This description was created by </sup>[<img alt="Ellipsis" src="https://img.shields.io/badge/Ellipsis-blue?color=175173">](https://www.ellipsis.dev?ref=Skyvern-AI%2Fskyvern-cloud&utm_source=github&utm_medium=referral)<sup> for c5d152cff5845be06ce0bde2c415f375809608a1. You can [customize](https://app.ellipsis.dev/Skyvern-AI/settings/summaries) this summary. It will automatically update as commits are pushed.</sup> ---- <!-- ELLIPSIS_HIDDEN --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Per-prompt cache-entry API with variant-aware prompt caching and safe attach/detach around LLM/router calls. * **Improvements** * Context now carries richer cache metadata and an explicit prompt-caching flag; centralized logging, cache-hit reporting, and per-variant locking to reduce races. * **Behavior** * Removed automatic runtime enablement of prompt caching; caching decisions are resolved via centralized settings/experiments. * **Tests & Docs** * Expanded tests and documentation for cold/warm cache flows, cached-token reporting, routing behavior, and variant-aware caching. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
WalkthroughAdds a prompt cache-variant system and prompt-caching settings resolution, propagates cache metadata through SkyvernContext, and refactors LLM API handler logging and conditional Vertex AI cache attachment for extract-action prompts. Changes
Sequence DiagramsequenceDiagram
participant Agent as Agent (ForgeAgent)
participant Context as SkyvernContext
participant Cache as Vertex AI Cache
participant Handler as LLM API Handler
participant LLM as LLM (e.g., Gemini)
Agent->>Agent: compute cache_variant (verification_code, magic_link, criterion)
Agent->>Agent: resolve prompt_caching_settings (async, experiment)
Agent->>Cache: create_vertex_cache_for_task(prompt, prompt_variant)
Cache-->>Context: store cache metadata (vertex_cache_key, name, vertex_cache_variant)
Agent->>Context: propagate variant & settings
Handler->>Context: read cache metadata & prompt_caching_settings
alt attach cache (prompt eligible && settings allow && model group)
Handler->>LLM: send request with cached_content reference (cache_name/key/variant)
LLM-->>Handler: response + cache_hit info
Handler->>Handler: log_vertex_cache_hit_if_needed(cached_tokens)
else no cache attachment / fallback
Handler->>LLM: send request without cache reference
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed d0b8db0 in 1 minute and 6 seconds. Click for details.
- Reviewed
598lines of code in4files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. skyvern/forge/sdk/trace/experiment_utils.py:89
- Draft comment:
Consider logging the exception details when fetching experiment flag values instead of silently skipping errors. This would aid debugging if a flag retrieval fails. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. skyvern/forge/sdk/core/skyvern_context.py:82
- Draft comment:
Simplify the pop_totp_code method by using pop with a default value (e.g. self.totp_codes.pop(task_id, None)) to avoid an explicit if-check. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. skyvern/forge/sdk/api/llm/api_handler_factory.py:591
- Draft comment:
Consider caching the response.usage in a local variable to simplify repeated calls to getattr for token extraction. This would improve readability and possibly performance. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. skyvern/forge/sdk/api/llm/api_handler_factory.py:319
- Draft comment:
The asynchronous function 'llm_api_handler_with_router_and_fallback' is quite long and deeply nested. Consider refactoring it into smaller helper functions for improved readability and maintainability. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_iCe4PNswaqdlDyXI
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to d0b8db0 in 1 minute and 55 seconds. Click for details.
- Reviewed
598lines of code in4files - Skipped
0files when reviewing. - Skipped posting
6draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. skyvern/forge/sdk/api/llm/api_handler_factory.py:90
- Draft comment:
The API handler functions (e.g., _dispatch_llm_call and get_call_stats) are quite long and contain repetitive logic (e.g., multiple re-initializations of token counts). Consider refactoring common token extraction and cost-calculation logic into helper functions for better readability and maintainability. - Reason this comment was not posted:
Comment was on unchanged code.
2. skyvern/forge/sdk/api/llm/api_handler_factory.py:150
- Draft comment:
Several broad 'except Exception:' blocks are used throughout the code. Consider catching more specific exceptions where possible to improve error clarity and avoid masking unexpected issues. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. skyvern/forge/sdk/api/llm/api_handler_factory.py:415
- Draft comment:
When injecting the context caching system message for OpenAI models, verify that no sensitive content is logged or stored in artifacts. A security review of logged prompt data may be beneficial. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. skyvern/forge/sdk/api/llm/api_handler_factory.py:870
- Draft comment:
The LLM request and response logging (LLM_REQUEST, LLM_RESPONSE, etc.) is comprehensive. Ensure that data included in these artifacts does not contain sensitive or PII data before sharing externally. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. skyvern/forge/sdk/core/skyvern_context.py:82
- Draft comment:
The SkyvernContext using ContextVar is implemented clearly and covers all necessary fields, including caching and flag settings. No issues were found here. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
6. skyvern/forge/sdk/trace/experiment_utils.py:63
- Draft comment:
The experiment metadata collection functions are robust with nested try/except handling per flag. Consider logging at a debug or info level when experiment data is successfully collected, to aid in monitoring. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_N9UULiW0ZUyVIh40
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
skyvern/forge/agent.py (2)
3-3: SHA1-based cache variant is non-cryptographic; consider clarifying or silencing lintUsing
hashlib.sha1here is only to derive a short, stable identifier forcomplete_criterion, not for any security boundary, so the S324 warning isn’t a practical risk. If you want to keep linters quiet, you could either:
- switch to a modern fast hash like
hashlib.blake2s()with the same 6‑char prefix, or- keep
sha1and add an inline comment /# noqa: S324explaining the non‑security use.Functionally this looks fine and gives compact, composable variants like
vc-ml-ccxxxxxx.Also applies to: 2471-2491
2683-2707: Prompt caching settings + cache_variant wiring are consistent; minor duplication around magic-link flag
- Pulling
prompt_caching_settingsfrom_get_prompt_caching_settings(context)and checking both"extract-actions"and"extract-action"makes the experiment/override behavior explicit while keeping the feature easily extensible.- The new
cache_variantderived fromverification_code_check,has_magic_link_page, and normalizedcomplete_criterionfeeds correctly into both Vertex cache creation and logging, giving clear separation between OTP/magic-link/criterion variants.- Caching remains tightly scoped: you only call
_create_vertex_cache_for_taskwheneffective_llm_keyis a Gemini key and caching is enabled, which matches the LLM handler side.Small optional cleanup: you call
context.has_magic_link_page(task.task_id)twice (once inprompt_kwargs, once for_build_extract_action_cache_variant), andhas_magic_link_pagecan mutate internal state when a page is closed. Computing it once into a localhas_magic_linkand reusing it would avoid any subtle divergence and a redundant context lookup.Also applies to: 2736-2742, 2759-2767, 2770-2776
skyvern/forge/sdk/api/llm/api_handler_factory.py (1)
1-1: Helper functions centralize artifact and Vertex cache-hit logging
_log_hashed_href_map_artifacts_if_neededcleanly de-duplicates the inline blocks, and the guards (context,step, non-speculative) look right to avoid noise from speculative steps._log_vertex_cache_hit_if_neededcorrectly scopes cache-hit logging to extract-actions prompts with a populatedvertex_cache_name, and carries throughcache_keyandcache_variantfor debugging.One improvement worth considering: the router-based handler also computes
cached_tokens, but doesn’t currently call_log_vertex_cache_hit_if_needed, so Vertex cache hits via router primaries won’t show up in this log. Calling it there too (usingmodel_usedas the identifier) would make cache observability consistent across router and direct paths.Also applies to: 33-34, 42-44, 67-103
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
skyvern/forge/agent.py(9 hunks)skyvern/forge/sdk/api/llm/api_handler_factory.py(14 hunks)skyvern/forge/sdk/core/skyvern_context.py(1 hunks)skyvern/forge/sdk/trace/experiment_utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Useruff checkandruff formatfor linting and formatting Python code
Usemypyfor type checking Python code
Use type hints in Python code
Maintain line length of 120 characters for Python code
**/*.py: Use Python 3.11+ features and type hints in Python code
Follow PEP 8 with a line length of 100 characters in Python code
Use absolute imports for all modules in Python code
Document all public functions and classes with Google-style docstrings in Python code
Usesnake_casefor variables and functions,PascalCasefor classes in Python code
Prefer async/await over callbacks for asynchronous programming in Python
Useasynciofor concurrency in Python asynchronous code
Always handle exceptions in async code in Python
Use context managers for resource cleanup in Python
Use specific exception classes rather than generic exceptions in error handling
Include meaningful error messages in exceptions
Log errors with appropriate severity levels in Python
Never expose sensitive information in error messages
Optimize database queries in Python for performance
Use appropriate data structures in Python for performance optimization
Implement caching where beneficial in Python code
Validate all inputs in Python code
Use environment variables for configuration instead of hardcoding values
Files:
skyvern/forge/sdk/trace/experiment_utils.pyskyvern/forge/sdk/core/skyvern_context.pyskyvern/forge/sdk/api/llm/api_handler_factory.pyskyvern/forge/agent.py
skyvern/forge/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
FastAPI-based REST API and WebSocket support should be located in
skyvern/forge/directory
Files:
skyvern/forge/sdk/trace/experiment_utils.pyskyvern/forge/sdk/core/skyvern_context.pyskyvern/forge/sdk/api/llm/api_handler_factory.pyskyvern/forge/agent.py
**/*.{py,ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use async/await patterns for asynchronous operations
Files:
skyvern/forge/sdk/trace/experiment_utils.pyskyvern/forge/sdk/core/skyvern_context.pyskyvern/forge/sdk/api/llm/api_handler_factory.pyskyvern/forge/agent.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T17:46:29.802Z
Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Configure LLM models via environment variables or `skyvern init llm`, supporting OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, and Ollama
Applied to files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
📚 Learning: 2025-11-24T17:46:29.802Z
Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Use `LLM_KEY` environment variable to specify which LLM model to use
Applied to files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
🪛 Ruff (0.14.6)
skyvern/forge/sdk/api/llm/api_handler_factory.py
461-461: Avoid specifying long messages outside the exception class
(TRY003)
464-464: Avoid specifying long messages outside the exception class
(TRY003)
509-509: Do not catch blind exception: Exception
(BLE001)
skyvern/forge/agent.py
2489-2489: Probable use of insecure hash functions in hashlib: sha1
(S324)
2831-2831: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Run tests and pre-commit hooks
🔇 Additional comments (8)
skyvern/forge/sdk/trace/experiment_utils.py (1)
63-71: Experiment flag rename aligns with new prompt-caching helperSwitching to
"PROMPT_CACHING_OPTIMIZATION"here matches the flag used in_get_prompt_caching_settings, so tracing will correctly capture the experiment state without changing behavior otherwise.skyvern/forge/sdk/core/skyvern_context.py (1)
35-43: New cache-related context fields are correctly typed and scoped
vertex_cache_key,vertex_cache_variant, andprompt_caching_settingsare optional, context-local, and line up with how ForgeAgent and the LLM handlers read/write them; no correctness issues here.skyvern/forge/agent.py (2)
2494-2500: Vertex cache key/variant handling and context metadata wiring look correct
cache_variant = prompt_variant or "std"andvariant_suffix = f"-{cache_variant}"give keys likeextract-action-static-vc-<llm_key>, preserving old behavior for the default and specializing when a variant is provided.- Cache creation is still safely guarded on
resolved_llm_keyand usesasyncio.to_threadto keep the event loop responsive.- Storing
vertex_cache_name,vertex_cache_key, andvertex_cache_varianton the context matches how the LLM handlers log and decide when to attach Vertex cache references.No functional issues spotted here.
Also applies to: 2510-2512, 2522-2523, 2533-2536, 2599-2603, 2604-2611
2804-2851: _get_prompt_caching_settings correctly prioritizes overrides and caches experiment result per context
- Global overrides via
LLMAPIHandlerFactory._prompt_caching_settingswin, which is what you want for scripts/tests.- Otherwise, the method computes a
distinct_idandorganization_id, initializescontext.prompt_caching_settingsto{}, and calls the"PROMPT_CACHING_OPTIMIZATION"experiment at most once per context, memoizing the result.- When enabled, you set both
EXTRACT_ACTION_PROMPT_NAMEandEXTRACT_ACTION_TEMPLATEtoTrue, which aligns with how_build_extract_action_promptcheckscache_enabled.The broad
except Exception as excaround the experiment call is defensible here since a feature-flag outage shouldn’t disrupt core prompting; the conservative fallback to{}is appropriate.skyvern/forge/sdk/api/llm/api_handler_factory.py (4)
277-289: Override handler now correctly bypasses experimentation for explicit model choicesLetting
get_override_llm_api_handlerdelegate toget_llm_api_handler(override_llm_key)ensures explicit overrides use the exact configured model/router, and the fallback todefaulton failure is well-logged and preserves behavior.
319-362: Router handler’s Vertex-cache path and speculative metadata wiring look robust
- The new
_log_hashed_href_map_artifacts_if_neededcall runs once per LLM invocation and is gated onstep/is_speculative_step, matching the direct-handler and LLMCaller behavior._log_llm_request_artifactcentralizes LLM_REQUEST artifact creation (with avertex_cache_attachedflag) and is only invoked for non-speculative steps, while still returningllm_request_jsonfor speculative metadata.should_attach_vertex_cacheis nicely constrained: it requires avertex_cache_name,prompt_name == "extract-actions",context.use_prompt_caching, a Geminimain_model_group, and a valid primary model dict._call_primary_with_vertex_cachecorrectly:
- deep-copies
litellm_params,- overlays the generic
parameters,- injects
cached_content,- and logs cache metadata including
vertex_cache_keyandcache_variantfrom the context.- If the primary-with-cache call fails (other than
CancelledError), you log and cleanly fall back to the router path.model_usedandllm_request_jsonare set in both primary and fallback paths and then fed into metrics +SpeculativeLLMMetadata, so speculative runs capture the full picture.Token accounting, cost calculation, and final metrics logging are unchanged structurally; the only addition is the extra
model_useddetail, which is useful for debugging primary vs fallback behavior.Also applies to: 363-373, 388-405, 441-452, 453-487, 488-497, 498-531, 564-608, 661-676, 678-695
700-705: Direct LLM handler’s caching and logging now align with the new prompt caching model
- The hashed-href artifact logging helper keeps behavior consistent with the router/LLMCaller paths while avoiding duplication.
- Vertex cache attachment is tightly scoped to:
cache_resource_namepresent on the context,prompt_name == "extract-actions",context.use_prompt_cachingset,- and a Gemini
model_name.- When attaching a cache, logging now includes
cache_name,vertex_cache_key, andvertex_cache_variant, which matches the metadata ForgeAgent stores on the context.- When a stale
cached_contentleaks intoactive_parametersfor other prompts/models, you explicitly strip it and log the removal with the same metadata, preventing cache references from persisting across unrelated calls.- Request artifacts include
vertex_cache_attachedbut not the rawcached_content, which reduces the risk of leaking provider-specific cache identifiers into artifacts._log_vertex_cache_hit_if_needed(context, prompt_name, model_name, cached_tokens)gives you clear, low-noise cache-hit logs tied to both the LLM identifier and the context’s cache metadata.Overall this keeps cache usage and observability well-contained around extract-actions + Gemini.
Also applies to: 756-778, 820-831, 835-849, 851-868, 942-959, 960-961
756-765: Hashed href map artifact logging is now uniformly handled across all LLM entry pointsUsing
_log_hashed_href_map_artifacts_if_neededin both factory-based handlers andLLMCaller.callensures any populatedcontext.hashed_href_mapis captured once per non-speculative step, regardless of whether the call goes through a router, direct LLM, or LLMCaller. The guards prevent speculative runs from generating extra artifacts.Also applies to: 1138-1147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
skyvern/forge/sdk/api/llm/api_handler_factory.py (6)
43-43: UnusedCHECK_USER_GOAL_PROMPT_NAMESconstantThis set isn’t referenced anywhere in this file. If it’s not used elsewhere yet, consider either wiring it into the relevant logic (e.g., gating behaviors by these prompt names) or dropping it to avoid dead code.
87-102: Vertex cache hit logging helper is sound but only used for direct handler
_log_vertex_cache_hit_if_neededcorrectly restricts logging to:
cached_tokens > 0prompt_name == EXTRACT_ACTION_PROMPT_NAMEcontextwithvertex_cache_name.You currently invoke it only in the single-model handler. If you also want cache-hit telemetry for the router + Vertex primary path, consider calling this helper there after computing
cached_tokensto keep logging behavior consistent across both code paths.
388-405: LLM request artifact helper for router pathThe
_log_llm_request_artifactinner function is a nice consolidation of request artifact logging and speculative metadata capture. Two minor points to consider:
- You intentionally log
parametersinstead of the fullactive_paramsfor privacy, but this means router-specificlitellm_params(andcached_content) aren’t visible in artifacts. If you ever need to debug cache behavior solely from artifacts, adding a redacted view of those fields (or just includingcached_contentwhen non-sensitive) could help.json.dumps(llm_request_payload)will raise if any value is not JSON-serializable. If you foresee custom/unusual objects inparameters(e.g., rich tool schemas), a small try/except aroundjson.dumpswith a fallback (e.g., stringifying non-serializable entries) would make artifact creation more robust.
441-531: Router primary-with-cache path and fallbacks: behavior and exception handlingThe new router + Vertex cache flow is generally solid:
should_attach_vertex_cachecorrectly gates on:
vertex_cache_namepresent,prompt_name == EXTRACT_ACTION_PROMPT_NAME,context.use_prompt_caching,- primary group containing
"gemini",- a resolvable primary model dict.
_call_primary_with_vertex_cache:
- Deep-copies
litellm_paramsand combines with userparameters.- Forces
cached_contentfor the primary call only.- Logs cache metadata (name/key/variant) clearly.
- On any non-
CancelledErrorfrom the primary call, you log a warning and transparently fall back torouter.acompletionwithout cache, preserving existing behavior.model_usedandllm_request_jsonare always set before use in metrics/speculative metadata.A few nits:
Broad
Exceptionin cache primary callThe nested
except Exception as cache_error:(Line 509) is intentionally used to treat any failure as a signal to retry via the router. That’s reasonable, but it does trigger BLE001 and slightly weakens observability.If feasible, narrow this to the expected error classes from
litellm.acompletion(e.g.,litellm.exceptions.APIError,ContextWindowExceededError, etc.), and let truly unexpected exceptions bubble to the outer handler.Generic
ValueErrorfor configuration problemsThe two
ValueErrorraises for missing primary configuration/litellm_params (Lines ~461, ~464) are clear, but static analysis (TRY003) prefers dedicated exception types. If these represent programmer/configuration mistakes rather than user input issues, a small custom exception (e.g.,RouterPrimaryConfigError) would make error handling and log filtering crisper.Router request artifact model label
_call_router_without_cachelogs themodelfield asllm_key, while the router is actually invoked withmodel=main_model_group. For debugging, it might be slightly clearer to logmain_model_groupas the model label so artifacts reflect the effective router target.These are all refinements; the core cache-attach and fallback behavior looks correct.
823-853: Vertex cache attach/removal logging in direct handlerThe updated logs that include
cache_name,cache_key, andcache_variantgive useful observability into which logical cache entry was attached or removed:
- Attach path is guarded by:
vertex_cache_namepresent,prompt_name == EXTRACT_ACTION_PROMPT_NAME,context.use_prompt_caching,"gemini" in model_name.lower(),- Elsepath strips any stray
cached_contentand logs that it was removed.Two small considerations:
- If
vertex_cache_nameorvertex_cache_keycontains project IDs or other semi-sensitive identifiers, you may want to confirm with your logging policy that emitting them atinfolevel is acceptable.- Given you also have
LLMAPIHandlerFactory._prompt_caching_settings/SkyvernContext.prompt_caching_settings, you might eventually want to gate this attach/remove behavior using those per-prompt settings to avoid relying solely onprompt_name == EXTRACT_ACTION_PROMPT_NAME.
104-109: Prompt caching settings field and setter are currently unused
LLMAPIHandlerFactory._prompt_caching_settingsandset_prompt_caching_settingsare defined but not read anywhere in this module. If you intend to drive cache-attachment behavior off these settings (as suggested by the docstring), consider:
- Wiring them into the Vertex cache gating logic (e.g., checking if the current
prompt_nameis enabled), or- Removing them until you’re ready to consume them.
This will avoid confusion about which switch (
context.use_prompt_cachingvs._prompt_caching_settings/prompt_caching_settings) is actually authoritative.Also applies to: 1065-1069
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
skyvern/forge/sdk/api/llm/api_handler_factory.py(14 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Useruff checkandruff formatfor linting and formatting Python code
Usemypyfor type checking Python code
Use type hints in Python code
Maintain line length of 120 characters for Python code
**/*.py: Use Python 3.11+ features and type hints in Python code
Follow PEP 8 with a line length of 100 characters in Python code
Use absolute imports for all modules in Python code
Document all public functions and classes with Google-style docstrings in Python code
Usesnake_casefor variables and functions,PascalCasefor classes in Python code
Prefer async/await over callbacks for asynchronous programming in Python
Useasynciofor concurrency in Python asynchronous code
Always handle exceptions in async code in Python
Use context managers for resource cleanup in Python
Use specific exception classes rather than generic exceptions in error handling
Include meaningful error messages in exceptions
Log errors with appropriate severity levels in Python
Never expose sensitive information in error messages
Optimize database queries in Python for performance
Use appropriate data structures in Python for performance optimization
Implement caching where beneficial in Python code
Validate all inputs in Python code
Use environment variables for configuration instead of hardcoding values
Files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
skyvern/forge/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
FastAPI-based REST API and WebSocket support should be located in
skyvern/forge/directory
Files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
**/*.{py,ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use async/await patterns for asynchronous operations
Files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T17:46:29.802Z
Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Configure LLM models via environment variables or `skyvern init llm`, supporting OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, and Ollama
Applied to files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
📚 Learning: 2025-11-24T17:46:29.802Z
Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Use `LLM_KEY` environment variable to specify which LLM model to use
Applied to files:
skyvern/forge/sdk/api/llm/api_handler_factory.py
🧬 Code graph analysis (1)
skyvern/forge/sdk/api/llm/api_handler_factory.py (4)
skyvern/forge/sdk/core/skyvern_context.py (1)
SkyvernContext(10-97)skyvern/forge/sdk/artifact/manager.py (1)
create_llm_artifact(278-335)skyvern/forge/sdk/artifact/models.py (1)
ArtifactType(10-53)skyvern/forge/sdk/api/llm/models.py (1)
LLMAPIHandler(90-107)
🪛 Ruff (0.14.6)
skyvern/forge/sdk/api/llm/api_handler_factory.py
461-461: Avoid specifying long messages outside the exception class
(TRY003)
464-464: Avoid specifying long messages outside the exception class
(TRY003)
509-509: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Run tests and pre-commit hooks
🔇 Additional comments (7)
skyvern/forge/sdk/api/llm/api_handler_factory.py (7)
67-85: Centralized hashed-href artifact logging looks correctThe
_log_hashed_href_map_artifacts_if_neededhelper correctly:
- Guards on
context,context.hashed_href_map,step, andnot is_speculative_step.- Reuses
create_llm_artifactso routing viastep/task_v2/thought/ai_suggestionstays consistent.This removes duplication and avoids logging during speculative calls; no issues from a correctness standpoint.
277-283: Override handler behavior comment is aligned with implementationThe new comment documenting that explicit overrides bypass experimentation and use
get_llm_api_handler(override_llm_key)matches the implementation. No functional concerns here.
365-372: Reusing hashed-href logging in router handlerUsing
_log_hashed_href_map_artifacts_if_neededhere aligns router-based calls with the direct handler andLLMCaller, and preserves the “no speculative step” guard. This looks correct and avoids repeated inline conditions.
700-704:get_llm_api_handlersignature extensionAdding
base_parameterstoget_llm_api_handlerand threading it intoLLMCallerconstruction is a clean extension of the API without breaking existing call sites. No issues here.
761-768: Consistent hashed-href artifact logging in direct handlerCalling
_log_hashed_href_map_artifacts_if_neededhere brings the single-model handler in line with the router andLLMCallerflows and respects the speculative-step guard. This refactor looks correct and non-breaking.
961-962: Vertex cache hit logging for direct handlerUsing
_log_vertex_cache_hit_if_needed(context, prompt_name, model_name, cached_tokens)right after computingcached_tokensensures you only log meaningful cache hits for theextract-actionsprompt. This is a lightweight addition with clear value for cache diagnostics.
1141-1148: Hashed-href artifact logging inLLMCaller.callExtending
_log_hashed_href_map_artifacts_if_neededtoLLMCaller.callkeeps artifact behavior consistent with the other LLM paths and continues to avoid speculative steps. This is a straightforward, correct reuse of the helper.
Summary
Scope Vertex cache entries per prompt variant
Extract-actions now tags cached static prompts with
"std"vs"vc"and includes that variant in the Vertex cache key (e.g.extract-action-static-vc-<llm_key>).SkyvernContextkeeps per-prompt cache entries (PromptCacheEntry) that track the static prompt, active variant, and model→cache-name map, so multiple prompts/models can safely opt into caching simultaneously.ForgeAgentclears the extract-actions cache entry after every call (including OTP retries) to avoid leaking the static prompt into later GPT requests.Router and single-model handlers look up cached content via the prompt entry and attach it only when the current prompt/model/variant has a cache. Router calls wrap cache attaching/removal inside a lock so the cache flag can’t stick around for other prompts.
Prompt-caching experiments/docs/tests
Updated scripts, docs, and tests to use the new prompt cache entry APIs instead of the old
cached_static_prompt/vertex_cache_nameflags.Router fallback tests now snapshot the model list during the call to confirm only the primary carries
cached_content.LLM handler refactor
Added helper functions for logging artifacts and applying the OpenAI context-cache system message so the top-level handlers are easier to follow.
Testing
uv run pytest tests/unit/test_api_handler_factory.py tests/unit/test_llm_router_fallback.py tests/unit/test_api_handler_cached_content_fix.pyImportant
Fixes Vertex cache leak by introducing variant-aware caching and refactoring LLM handlers for improved prompt management.
agent.py.SkyvernContexttracks cache metadata includingvertex_cache_name,vertex_cache_key, andvertex_cache_variant.ForgeAgentclears cache entries post-call to prevent prompt leakage.api_handler_factory.py.SkyvernContextto includeprompt_caching_settings.api_handler_factory.py.This description was created by
for d0b8db0. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Performance
Logging
✏️ Tip: You can customize this high-level summary in your review settings.
🔧 This PR fixes a critical Vertex AI cache leak by implementing variant-aware prompt caching that prevents cached static prompts from bleeding into different LLM requests. The changes introduce proper cache scoping per prompt variant (standard, verification code, magic link) and ensure caching is only applied to appropriate Gemini models.
🔍 Detailed Analysis
Key Changes
_build_extract_action_cache_variant()method that creates unique cache identifiers based on verification code checks, magic link pages, and completion criteria using SHA1 hashingSkyvernContextwithvertex_cache_key,vertex_cache_variant, andprompt_caching_settingsfields to maintain richer cache metadata_get_prompt_caching_settings()method that resolves caching configuration through explicit overrides or PostHog experiments with per-context caching_create_vertex_cache_for_task()to accept prompt variants and generate variant-specific cache keys likeextract-action-static-vc-<llm_key>Technical Implementation
sequenceDiagram participant FA as ForgeAgent participant CC as CacheManager participant VC as VertexCache participant LLM as LLMHandler FA->>FA: _build_extract_action_cache_variant() Note over FA: Generate variant ID (std/vc/ml/cc<hash>) FA->>CC: _create_vertex_cache_for_task() Note over CC: Include variant in cache key CC->>VC: Create cache with variant-specific key VC-->>CC: Return cache metadata CC->>FA: Store cache_key, cache_variant in context FA->>LLM: Send request with cached content Note over LLM: Only attach cache for matching variant FA->>FA: Clear cache entry after call Note over FA: Prevent leaks to subsequent requestsImpact
Created with Palmier