Pedro/fix vertex cache leak #4135

pedrohsdb · 2025-11-28T23:06:32Z

Summary

Scope Vertex cache entries per prompt variant
Extract-actions now tags cached static prompts with "std" vs "vc" and includes that variant in the Vertex cache key (e.g. extract-action-static-vc-<llm_key>).
SkyvernContext keeps per-prompt cache entries (PromptCacheEntry) that track the static prompt, active variant, and model→cache-name map, so multiple prompts/models can safely opt into caching simultaneously.
ForgeAgent clears the extract-actions cache entry after every call (including OTP retries) to avoid leaking the static prompt into later GPT requests.
Router and single-model handlers look up cached content via the prompt entry and attach it only when the current prompt/model/variant has a cache. Router calls wrap cache attaching/removal inside a lock so the cache flag can’t stick around for other prompts.
Prompt-caching experiments/docs/tests
Updated scripts, docs, and tests to use the new prompt cache entry APIs instead of the old cached_static_prompt/vertex_cache_name flags.
Router fallback tests now snapshot the model list during the call to confirm only the primary carries cached_content.
LLM handler refactor
Added helper functions for logging artifacts and applying the OpenAI context-cache system message so the top-level handlers are easier to follow.

Testing

uv run pytest tests/unit/test_api_handler_factory.py tests/unit/test_llm_router_fallback.py tests/unit/test_api_handler_cached_content_fix.py

Important

Fixes Vertex cache leak by introducing variant-aware caching and refactoring LLM handlers for improved prompt management.

Vertex Cache Enhancements:
- Cache entries are now scoped per prompt variant in agent.py.
- SkyvernContext tracks cache metadata including vertex_cache_name, vertex_cache_key, and vertex_cache_variant.
- ForgeAgent clears cache entries post-call to prevent prompt leakage.
Prompt Caching Updates:
- Introduced variant-aware prompt caching in api_handler_factory.py.
- Updated SkyvernContext to include prompt_caching_settings.
- Removed automatic runtime enablement of prompt caching.
LLM Handler Refactor:
- Added helper functions for logging and context-cache system message application in api_handler_factory.py.
Testing and Documentation:
- Updated tests and documentation to reflect new caching mechanisms and behaviors.

^{This description was created by}^{for d0b8db0. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

Performance
- Smarter, variant-aware prompt caching for extract-action flows to improve response efficiency and cache relevance
- More reliable cache hit handling and metadata propagation for better cache reuse
Logging
- Expanded cache and request diagnostics surfaced in traces to clarify when caches are created, attached, or hit
- Standardized artifact logging for hashed references and LLM request payloads

_{✏️ Tip: You can customize this high-level summary in your review settings.}

🔧 This PR fixes a critical Vertex AI cache leak by implementing variant-aware prompt caching that prevents cached static prompts from bleeding into different LLM requests. The changes introduce proper cache scoping per prompt variant (standard, verification code, magic link) and ensure caching is only applied to appropriate Gemini models.

🔍 Detailed Analysis

Key Changes

Cache Variant System: Added _build_extract_action_cache_variant() method that creates unique cache identifiers based on verification code checks, magic link pages, and completion criteria using SHA1 hashing
Enhanced Context Tracking: Extended SkyvernContext with vertex_cache_key, vertex_cache_variant, and prompt_caching_settings fields to maintain richer cache metadata
Centralized Caching Settings: Implemented _get_prompt_caching_settings() method that resolves caching configuration through explicit overrides or PostHog experiments with per-context caching
Improved Cache Management: Modified _create_vertex_cache_for_task() to accept prompt variants and generate variant-specific cache keys like extract-action-static-vc-<llm_key>

Technical Implementation

sequenceDiagram
    participant FA as ForgeAgent
    participant CC as CacheManager
    participant VC as VertexCache
    participant LLM as LLMHandler
    
    FA->>FA: _build_extract_action_cache_variant()
    Note over FA: Generate variant ID (std/vc/ml/cc<hash>)
    
    FA->>CC: _create_vertex_cache_for_task()
    Note over CC: Include variant in cache key
    
    CC->>VC: Create cache with variant-specific key
    VC-->>CC: Return cache metadata
    
    CC->>FA: Store cache_key, cache_variant in context
    
    FA->>LLM: Send request with cached content
    Note over LLM: Only attach cache for matching variant
    
    FA->>FA: Clear cache entry after call
    Note over FA: Prevent leaks to subsequent requests

Impact

Cache Leak Prevention: Eliminates the risk of cached static prompts from one context (e.g., OTP flows) being incorrectly applied to different contexts (e.g., standard flows)
Model-Specific Caching: Ensures Vertex AI caching is only applied to Gemini models, preventing inappropriate cache attachment to fallback models
Improved Observability: Enhanced logging with cache variant information and centralized experiment evaluation for better debugging and monitoring
Performance Optimization: Maintains caching benefits while ensuring correctness through proper cache scoping and lifecycle management

Created with Palmier

## Summary - **Scope Vertex cache entries per prompt variant** - Extract-actions now tags cached static prompts with `"std"` vs `"vc"` and includes that variant in the Vertex cache key (e.g. `extract-action-static-vc-<llm_key>`). - `SkyvernContext` keeps per-prompt cache entries (`PromptCacheEntry`) that track the static prompt, active variant, and model→cache-name map, so multiple prompts/models can safely opt into caching simultaneously. - `ForgeAgent` clears the extract-actions cache entry after every call (including OTP retries) to avoid leaking the static prompt into later GPT requests. - Router and single-model handlers look up cached content via the prompt entry and attach it only when the current prompt/model/variant has a cache. Router calls wrap cache attaching/removal inside a lock so the cache flag can’t stick around for other prompts. - **Prompt-caching experiments/docs/tests** - Updated scripts, docs, and tests to use the new prompt cache entry APIs instead of the old `cached_static_prompt`/`vertex_cache_name` flags. - Router fallback tests now snapshot the model list during the call to confirm only the primary carries `cached_content`. - **LLM handler refactor** - Added helper functions for logging artifacts and applying the OpenAI context-cache system message so the top-level handlers are easier to follow. ## Testing - `uv run pytest tests/unit/test_api_handler_factory.py tests/unit/test_llm_router_fallback.py tests/unit/test_api_handler_cached_content_fix.py`  ---- > [!IMPORTANT] > Fix Vertex cache leak by scoping cache entries per prompt variant and ensuring caching is only attached to Gemini models. > > - **Behavior**: > - Vertex cache entries are now scoped per prompt variant, preventing cache leaks. > - `ForgeAgent` clears extract-actions cache after each call to avoid static prompt leaks. > - Caching is only attached to Gemini models, not fallback models. > - **Context and Caching**: > - `SkyvernContext` now includes `vertex_cache_key`, `vertex_cache_variant`, and `prompt_caching_settings`. > - `LLMAPIHandlerFactory` updated to handle prompt caching settings and log cache hits. > - **Testing and Documentation**: > - Tests updated to verify caching behavior for Gemini and non-Gemini models. > - Documentation updated to reflect new caching API and behavior. > > This description was created by [<img alt="Ellipsis" src="https://img.shields.io/badge/Ellipsis-blue?color=175173">](https://www.ellipsis.dev?ref=Skyvern-AI%2Fskyvern-cloud&utm_source=github&utm_medium=referral) for c5d152cff5845be06ce0bde2c415f375809608a1. You can [customize](https://app.ellipsis.dev/Skyvern-AI/settings/summaries) this summary. It will automatically update as commits are pushed. ----   ## Summary by CodeRabbit * **New Features** * Per-prompt cache-entry API with variant-aware prompt caching and safe attach/detach around LLM/router calls. * **Improvements** * Context now carries richer cache metadata and an explicit prompt-caching flag; centralized logging, cache-hit reporting, and per-variant locking to reduce races. * **Behavior** * Removed automatic runtime enablement of prompt caching; caching decisions are resolved via centralized settings/experiments. * **Tests & Docs** * Expanded tests and documentation for cold/warm cache flows, cached-token reporting, routing behavior, and variant-aware caching. ✏️ Tip: You can customize this high-level summary in your review settings.

coderabbitai · 2025-11-28T23:06:48Z

Walkthrough

Adds a prompt cache-variant system and prompt-caching settings resolution, propagates cache metadata through SkyvernContext, and refactors LLM API handler logging and conditional Vertex AI cache attachment for extract-action prompts.

Changes

Cohort / File(s)	Summary
Cache-variant & prompt-caching logic `skyvern/forge/agent.py`	Adds `_build_extract_action_cache_variant()` to compute compact variant keys, extends `_create_vertex_cache_for_task()` to accept `prompt_variant` and `context`, adds async `_get_prompt_caching_settings()` to resolve/memoize prompt-caching settings (experimentation-backed), integrates variant propagation into prompt-building, and imports `hashlib`.
Context metadata extension `skyvern/forge/sdk/core/skyvern_context.py`	Adds `vertex_cache_key`, `vertex_cache_variant`, and `prompt_caching_settings` fields to `SkyvernContext` to carry cache metadata and resolved caching settings.
LLM API handler refactor & logging `skyvern/forge/sdk/api/llm/api_handler_factory.py`	Introduces `_log_hashed_href_map_artifacts_if_needed()` and `_log_vertex_cache_hit_if_needed()` helpers, centralizes hashed-href-map artifact logging, adds guarded vertex-cache attachment decisions (prompt name, context.prompt_caching_settings, model-group checks), refactors primary-with-cache vs router/fallback flows, and logs cache metadata (cache_name, cache_key, cache_variant) and cache-hit usage.
Experiment flag rename `skyvern/forge/sdk/trace/experiment_utils.py`	Replaces collected experiment flag `"PROMPT_CACHING_ENABLED"` with `"PROMPT_CACHING_OPTIMIZATION"` in metadata collection.

Sequence Diagram

sequenceDiagram
    participant Agent as Agent (ForgeAgent)
    participant Context as SkyvernContext
    participant Cache as Vertex AI Cache
    participant Handler as LLM API Handler
    participant LLM as LLM (e.g., Gemini)

    Agent->>Agent: compute cache_variant (verification_code, magic_link, criterion)
    Agent->>Agent: resolve prompt_caching_settings (async, experiment)
    Agent->>Cache: create_vertex_cache_for_task(prompt, prompt_variant)
    Cache-->>Context: store cache metadata (vertex_cache_key, name, vertex_cache_variant)
    Agent->>Context: propagate variant & settings

    Handler->>Context: read cache metadata & prompt_caching_settings
    alt attach cache (prompt eligible && settings allow && model group)
        Handler->>LLM: send request with cached_content reference (cache_name/key/variant)
        LLM-->>Handler: response + cache_hit info
        Handler->>Handler: log_vertex_cache_hit_if_needed(cached_tokens)
    else no cache attachment / fallback
        Handler->>LLM: send request without cache reference
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Review focus areas:
- skyvern/forge/agent.py: correctness of variant key construction, hashing, and async caching-settings memoization.
- skyvern/forge/sdk/api/llm/api_handler_factory.py: conditional cache-attachment logic across primary/router/fallback paths and correctness of new centralized logging helpers.
- skyvern/forge/sdk/core/skyvern_context.py: ensure new context fields are populated/consumed consistently across call sites.

Possibly related PRs

Pedro/vertex cache minimal fix #3981 — prior changes to extract-action Vertex cache flow and agent._create_vertex_cache_for_task() (highly related).
restore vertex cache credentials #4050 — overlaps in Vertex cache attachment and logging changes in api_handler_factory.py.
use explicit vertex credentials for cache manager #4039 — related edits to Vertex cache usage/logging in api_handler_factory.py.

Suggested reviewers

wintonzheng
jomido

"🐰
I nibbled through variants, small and neat,
I rolled each hash into a tidy beat.
Experiments whispered which caches to keep,
Logs sparkle bright where cache-hits leap —
Hooray, prompt hops faster down the street!"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely identifies the main purpose of the PR: fixing a vertex cache leak by scoping cache entries per prompt variant and preventing cached prompts from leaking into subsequent calls.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pedro/fix-vertex-cache-leak

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed d0b8db0 in 1 minute and 6 seconds. Click for details.

Reviewed 598 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. skyvern/forge/sdk/trace/experiment_utils.py:89

Draft comment:
Consider logging the exception details when fetching experiment flag values instead of silently skipping errors. This would aid debugging if a flag retrieval fails.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. skyvern/forge/sdk/core/skyvern_context.py:82

Draft comment:
Simplify the pop_totp_code method by using pop with a default value (e.g. self.totp_codes.pop(task_id, None)) to avoid an explicit if-check.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. skyvern/forge/sdk/api/llm/api_handler_factory.py:591

Draft comment:
Consider caching the response.usage in a local variable to simplify repeated calls to getattr for token extraction. This would improve readability and possibly performance.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. skyvern/forge/sdk/api/llm/api_handler_factory.py:319

Draft comment:
The asynchronous function 'llm_api_handler_with_router_and_fallback' is quite long and deeply nested. Consider refactoring it into smaller helper functions for improved readability and maintainability.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_iCe4PNswaqdlDyXI

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to d0b8db0 in 1 minute and 55 seconds. Click for details.

Reviewed 598 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. skyvern/forge/sdk/api/llm/api_handler_factory.py:90

Draft comment:
The API handler functions (e.g., _dispatch_llm_call and get_call_stats) are quite long and contain repetitive logic (e.g., multiple re-initializations of token counts). Consider refactoring common token extraction and cost-calculation logic into helper functions for better readability and maintainability.
Reason this comment was not posted:
Comment was on unchanged code.

2. skyvern/forge/sdk/api/llm/api_handler_factory.py:150

Draft comment:
Several broad 'except Exception:' blocks are used throughout the code. Consider catching more specific exceptions where possible to improve error clarity and avoid masking unexpected issues.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. skyvern/forge/sdk/api/llm/api_handler_factory.py:415

Draft comment:
When injecting the context caching system message for OpenAI models, verify that no sensitive content is logged or stored in artifacts. A security review of logged prompt data may be beneficial.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. skyvern/forge/sdk/api/llm/api_handler_factory.py:870

Draft comment:
The LLM request and response logging (LLM_REQUEST, LLM_RESPONSE, etc.) is comprehensive. Ensure that data included in these artifacts does not contain sensitive or PII data before sharing externally.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. skyvern/forge/sdk/core/skyvern_context.py:82

Draft comment:
The SkyvernContext using ContextVar is implemented clearly and covers all necessary fields, including caching and flag settings. No issues were found here.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

6. skyvern/forge/sdk/trace/experiment_utils.py:63

Draft comment:
The experiment metadata collection functions are robust with nested try/except handling per flag. Consider logging at a debug or info level when experiment data is successfully collected, to aid in monitoring.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_N9UULiW0ZUyVIh40

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

skyvern/forge/agent.py (2)

3-3: SHA1-based cache variant is non-cryptographic; consider clarifying or silencing lint

Using hashlib.sha1 here is only to derive a short, stable identifier for complete_criterion, not for any security boundary, so the S324 warning isn’t a practical risk. If you want to keep linters quiet, you could either:

switch to a modern fast hash like hashlib.blake2s() with the same 6‑char prefix, or

keep sha1 and add an inline comment / # noqa: S324 explaining the non‑security use.

Functionally this looks fine and gives compact, composable variants like vc-ml-ccxxxxxx.

Also applies to: 2471-2491

2683-2707: Prompt caching settings + cache_variant wiring are consistent; minor duplication around magic-link flag

Pulling prompt_caching_settings from _get_prompt_caching_settings(context) and checking both "extract-actions" and "extract-action" makes the experiment/override behavior explicit while keeping the feature easily extensible.

The new cache_variant derived from verification_code_check, has_magic_link_page, and normalized complete_criterion feeds correctly into both Vertex cache creation and logging, giving clear separation between OTP/magic-link/criterion variants.

Caching remains tightly scoped: you only call _create_vertex_cache_for_task when effective_llm_key is a Gemini key and caching is enabled, which matches the LLM handler side.

Small optional cleanup: you call context.has_magic_link_page(task.task_id) twice (once in prompt_kwargs, once for _build_extract_action_cache_variant), and has_magic_link_page can mutate internal state when a page is closed. Computing it once into a local has_magic_link and reusing it would avoid any subtle divergence and a redundant context lookup.

Also applies to: 2736-2742, 2759-2767, 2770-2776

skyvern/forge/sdk/api/llm/api_handler_factory.py (1)

1-1: Helper functions centralize artifact and Vertex cache-hit logging

_log_hashed_href_map_artifacts_if_needed cleanly de-duplicates the inline blocks, and the guards (context, step, non-speculative) look right to avoid noise from speculative steps.

_log_vertex_cache_hit_if_needed correctly scopes cache-hit logging to extract-actions prompts with a populated vertex_cache_name, and carries through cache_key and cache_variant for debugging.

One improvement worth considering: the router-based handler also computes cached_tokens, but doesn’t currently call _log_vertex_cache_hit_if_needed, so Vertex cache hits via router primaries won’t show up in this log. Calling it there too (using model_used as the identifier) would make cache observability consistent across router and direct paths.

Also applies to: 33-34, 42-44, 67-103

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd8a3fd and d0b8db0.

📒 Files selected for processing (4)

skyvern/forge/agent.py (9 hunks)
skyvern/forge/sdk/api/llm/api_handler_factory.py (14 hunks)
skyvern/forge/sdk/core/skyvern_context.py (1 hunks)
skyvern/forge/sdk/trace/experiment_utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use ruff check and ruff format for linting and formatting Python code
Use mypy for type checking Python code
Use type hints in Python code
Maintain line length of 120 characters for Python code

**/*.py: Use Python 3.11+ features and type hints in Python code
Follow PEP 8 with a line length of 100 characters in Python code
Use absolute imports for all modules in Python code
Document all public functions and classes with Google-style docstrings in Python code
Use snake_case for variables and functions, PascalCase for classes in Python code
Prefer async/await over callbacks for asynchronous programming in Python
Use asyncio for concurrency in Python asynchronous code
Always handle exceptions in async code in Python
Use context managers for resource cleanup in Python
Use specific exception classes rather than generic exceptions in error handling
Include meaningful error messages in exceptions
Log errors with appropriate severity levels in Python
Never expose sensitive information in error messages
Optimize database queries in Python for performance
Use appropriate data structures in Python for performance optimization
Implement caching where beneficial in Python code
Validate all inputs in Python code
Use environment variables for configuration instead of hardcoding values

Files:

skyvern/forge/sdk/trace/experiment_utils.py
skyvern/forge/sdk/core/skyvern_context.py
skyvern/forge/sdk/api/llm/api_handler_factory.py
skyvern/forge/agent.py

skyvern/forge/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

FastAPI-based REST API and WebSocket support should be located in skyvern/forge/ directory

Files:

skyvern/forge/sdk/trace/experiment_utils.py
skyvern/forge/sdk/core/skyvern_context.py
skyvern/forge/sdk/api/llm/api_handler_factory.py
skyvern/forge/agent.py

**/*.{py,ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use async/await patterns for asynchronous operations

Files:

skyvern/forge/sdk/trace/experiment_utils.py
skyvern/forge/sdk/core/skyvern_context.py
skyvern/forge/sdk/api/llm/api_handler_factory.py
skyvern/forge/agent.py

🧠 Learnings (2)

📚 Learning: 2025-11-24T17:46:29.802Z

Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Configure LLM models via environment variables or `skyvern init llm`, supporting OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, and Ollama

Applied to files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

📚 Learning: 2025-11-24T17:46:29.802Z

Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Use `LLM_KEY` environment variable to specify which LLM model to use

Applied to files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

🪛 Ruff (0.14.6)

skyvern/forge/sdk/api/llm/api_handler_factory.py

461-461: Avoid specifying long messages outside the exception class

(TRY003)

464-464: Avoid specifying long messages outside the exception class

(TRY003)

509-509: Do not catch blind exception: Exception

(BLE001)

skyvern/forge/agent.py

2489-2489: Probable use of insecure hash functions in hashlib: sha1

(S324)

2831-2831: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Optimize new Python code in this PR
GitHub Check: Run tests and pre-commit hooks

🔇 Additional comments (8)

skyvern/forge/sdk/trace/experiment_utils.py (1)

63-71: Experiment flag rename aligns with new prompt-caching helper

Switching to "PROMPT_CACHING_OPTIMIZATION" here matches the flag used in _get_prompt_caching_settings, so tracing will correctly capture the experiment state without changing behavior otherwise.

skyvern/forge/sdk/core/skyvern_context.py (1)

35-43: New cache-related context fields are correctly typed and scoped

vertex_cache_key, vertex_cache_variant, and prompt_caching_settings are optional, context-local, and line up with how ForgeAgent and the LLM handlers read/write them; no correctness issues here.

skyvern/forge/agent.py (2)

2494-2500: Vertex cache key/variant handling and context metadata wiring look correct

cache_variant = prompt_variant or "std" and variant_suffix = f"-{cache_variant}" give keys like extract-action-static-vc-<llm_key>, preserving old behavior for the default and specializing when a variant is provided.

Cache creation is still safely guarded on resolved_llm_key and uses asyncio.to_thread to keep the event loop responsive.

Storing vertex_cache_name, vertex_cache_key, and vertex_cache_variant on the context matches how the LLM handlers log and decide when to attach Vertex cache references.

No functional issues spotted here.

Also applies to: 2510-2512, 2522-2523, 2533-2536, 2599-2603, 2604-2611

2804-2851: _get_prompt_caching_settings correctly prioritizes overrides and caches experiment result per context

Global overrides via LLMAPIHandlerFactory._prompt_caching_settings win, which is what you want for scripts/tests.

Otherwise, the method computes a distinct_id and organization_id, initializes context.prompt_caching_settings to {}, and calls the "PROMPT_CACHING_OPTIMIZATION" experiment at most once per context, memoizing the result.

When enabled, you set both EXTRACT_ACTION_PROMPT_NAME and EXTRACT_ACTION_TEMPLATE to True, which aligns with how _build_extract_action_prompt checks cache_enabled.

The broad except Exception as exc around the experiment call is defensible here since a feature-flag outage shouldn’t disrupt core prompting; the conservative fallback to {} is appropriate.

skyvern/forge/sdk/api/llm/api_handler_factory.py (4)

277-289: Override handler now correctly bypasses experimentation for explicit model choices

Letting get_override_llm_api_handler delegate to get_llm_api_handler(override_llm_key) ensures explicit overrides use the exact configured model/router, and the fallback to default on failure is well-logged and preserves behavior.

319-362: Router handler’s Vertex-cache path and speculative metadata wiring look robust

The new _log_hashed_href_map_artifacts_if_needed call runs once per LLM invocation and is gated on step/is_speculative_step, matching the direct-handler and LLMCaller behavior.

_log_llm_request_artifact centralizes LLM_REQUEST artifact creation (with a vertex_cache_attached flag) and is only invoked for non-speculative steps, while still returning llm_request_json for speculative metadata.

should_attach_vertex_cache is nicely constrained: it requires a vertex_cache_name, prompt_name == "extract-actions", context.use_prompt_caching, a Gemini main_model_group, and a valid primary model dict.

_call_primary_with_vertex_cache correctly:

deep-copies litellm_params,

overlays the generic parameters,

injects cached_content,

and logs cache metadata including vertex_cache_key and cache_variant from the context.

If the primary-with-cache call fails (other than CancelledError), you log and cleanly fall back to the router path.

model_used and llm_request_json are set in both primary and fallback paths and then fed into metrics + SpeculativeLLMMetadata, so speculative runs capture the full picture.

Token accounting, cost calculation, and final metrics logging are unchanged structurally; the only addition is the extra model_used detail, which is useful for debugging primary vs fallback behavior.

Also applies to: 363-373, 388-405, 441-452, 453-487, 488-497, 498-531, 564-608, 661-676, 678-695

700-705: Direct LLM handler’s caching and logging now align with the new prompt caching model

The hashed-href artifact logging helper keeps behavior consistent with the router/LLMCaller paths while avoiding duplication.

Vertex cache attachment is tightly scoped to:

cache_resource_name present on the context,

prompt_name == "extract-actions",

context.use_prompt_caching set,

and a Gemini model_name.

When attaching a cache, logging now includes cache_name, vertex_cache_key, and vertex_cache_variant, which matches the metadata ForgeAgent stores on the context.

When a stale cached_content leaks into active_parameters for other prompts/models, you explicitly strip it and log the removal with the same metadata, preventing cache references from persisting across unrelated calls.

Request artifacts include vertex_cache_attached but not the raw cached_content, which reduces the risk of leaking provider-specific cache identifiers into artifacts.

_log_vertex_cache_hit_if_needed(context, prompt_name, model_name, cached_tokens) gives you clear, low-noise cache-hit logs tied to both the LLM identifier and the context’s cache metadata.

Overall this keeps cache usage and observability well-contained around extract-actions + Gemini.

Also applies to: 756-778, 820-831, 835-849, 851-868, 942-959, 960-961

756-765: Hashed href map artifact logging is now uniformly handled across all LLM entry points

Using _log_hashed_href_map_artifacts_if_needed in both factory-based handlers and LLMCaller.call ensures any populated context.hashed_href_map is captured once per non-speculative step, regardless of whether the call goes through a router, direct LLM, or LLMCaller. The guards prevent speculative runs from generating extra artifacts.

Also applies to: 1138-1147

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

skyvern/forge/sdk/api/llm/api_handler_factory.py (6)

43-43: Unused CHECK_USER_GOAL_PROMPT_NAMES constant

This set isn’t referenced anywhere in this file. If it’s not used elsewhere yet, consider either wiring it into the relevant logic (e.g., gating behaviors by these prompt names) or dropping it to avoid dead code.

87-102: Vertex cache hit logging helper is sound but only used for direct handler

_log_vertex_cache_hit_if_needed correctly restricts logging to:

cached_tokens > 0

prompt_name == EXTRACT_ACTION_PROMPT_NAME

context with vertex_cache_name.

You currently invoke it only in the single-model handler. If you also want cache-hit telemetry for the router + Vertex primary path, consider calling this helper there after computing cached_tokens to keep logging behavior consistent across both code paths.

388-405: LLM request artifact helper for router path

The _log_llm_request_artifact inner function is a nice consolidation of request artifact logging and speculative metadata capture. Two minor points to consider:

You intentionally log parameters instead of the full active_params for privacy, but this means router-specific litellm_params (and cached_content) aren’t visible in artifacts. If you ever need to debug cache behavior solely from artifacts, adding a redacted view of those fields (or just including cached_content when non-sensitive) could help.

json.dumps(llm_request_payload) will raise if any value is not JSON-serializable. If you foresee custom/unusual objects in parameters (e.g., rich tool schemas), a small try/except around json.dumps with a fallback (e.g., stringifying non-serializable entries) would make artifact creation more robust.

441-531: Router primary-with-cache path and fallbacks: behavior and exception handling

The new router + Vertex cache flow is generally solid:

should_attach_vertex_cache correctly gates on:

vertex_cache_name present,

prompt_name == EXTRACT_ACTION_PROMPT_NAME,

context.use_prompt_caching,

primary group containing "gemini",

a resolvable primary model dict.

_call_primary_with_vertex_cache:

Deep-copies litellm_params and combines with user parameters.

Forces cached_content for the primary call only.

Logs cache metadata (name/key/variant) clearly.

On any non-CancelledError from the primary call, you log a warning and transparently fall back to router.acompletion without cache, preserving existing behavior.

model_used and llm_request_json are always set before use in metrics/speculative metadata.

A few nits:

Broad Exception in cache primary call

The nested except Exception as cache_error: (Line 509) is intentionally used to treat any failure as a signal to retry via the router. That’s reasonable, but it does trigger BLE001 and slightly weakens observability.

If feasible, narrow this to the expected error classes from litellm.acompletion (e.g., litellm.exceptions.APIError, ContextWindowExceededError, etc.), and let truly unexpected exceptions bubble to the outer handler.

Generic ValueError for configuration problems

The two ValueError raises for missing primary configuration/litellm_params (Lines ~461, ~464) are clear, but static analysis (TRY003) prefers dedicated exception types. If these represent programmer/configuration mistakes rather than user input issues, a small custom exception (e.g., RouterPrimaryConfigError) would make error handling and log filtering crisper.

Router request artifact model label

_call_router_without_cache logs the model field as llm_key, while the router is actually invoked with model=main_model_group. For debugging, it might be slightly clearer to log main_model_group as the model label so artifacts reflect the effective router target.

These are all refinements; the core cache-attach and fallback behavior looks correct.

823-853: Vertex cache attach/removal logging in direct handler

The updated logs that include cache_name, cache_key, and cache_variant give useful observability into which logical cache entry was attached or removed:

Attach path is guarded by:

vertex_cache_name present,

prompt_name == EXTRACT_ACTION_PROMPT_NAME,

context.use_prompt_caching,

"gemini" in model_name.lower(),

Elsepath strips any stray cached_content and logs that it was removed.

Two small considerations:

If vertex_cache_name or vertex_cache_key contains project IDs or other semi-sensitive identifiers, you may want to confirm with your logging policy that emitting them at info level is acceptable.

Given you also have LLMAPIHandlerFactory._prompt_caching_settings / SkyvernContext.prompt_caching_settings, you might eventually want to gate this attach/remove behavior using those per-prompt settings to avoid relying solely on prompt_name == EXTRACT_ACTION_PROMPT_NAME.

104-109: Prompt caching settings field and setter are currently unused

LLMAPIHandlerFactory._prompt_caching_settings and set_prompt_caching_settings are defined but not read anywhere in this module. If you intend to drive cache-attachment behavior off these settings (as suggested by the docstring), consider:

Wiring them into the Vertex cache gating logic (e.g., checking if the current prompt_name is enabled), or

Removing them until you’re ready to consume them.

This will avoid confusion about which switch (context.use_prompt_caching vs. _prompt_caching_settings / prompt_caching_settings) is actually authoritative.

Also applies to: 1065-1069

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0b8db0 and e3376c2.

📒 Files selected for processing (1)

skyvern/forge/sdk/api/llm/api_handler_factory.py (14 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use ruff check and ruff format for linting and formatting Python code
Use mypy for type checking Python code
Use type hints in Python code
Maintain line length of 120 characters for Python code

**/*.py: Use Python 3.11+ features and type hints in Python code
Follow PEP 8 with a line length of 100 characters in Python code
Use absolute imports for all modules in Python code
Document all public functions and classes with Google-style docstrings in Python code
Use snake_case for variables and functions, PascalCase for classes in Python code
Prefer async/await over callbacks for asynchronous programming in Python
Use asyncio for concurrency in Python asynchronous code
Always handle exceptions in async code in Python
Use context managers for resource cleanup in Python
Use specific exception classes rather than generic exceptions in error handling
Include meaningful error messages in exceptions
Log errors with appropriate severity levels in Python
Never expose sensitive information in error messages
Optimize database queries in Python for performance
Use appropriate data structures in Python for performance optimization
Implement caching where beneficial in Python code
Validate all inputs in Python code
Use environment variables for configuration instead of hardcoding values

Files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

skyvern/forge/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

FastAPI-based REST API and WebSocket support should be located in skyvern/forge/ directory

Files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

**/*.{py,ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use async/await patterns for asynchronous operations

Files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

🧠 Learnings (2)

📚 Learning: 2025-11-24T17:46:29.802Z

Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Configure LLM models via environment variables or `skyvern init llm`, supporting OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, and Ollama

Applied to files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

📚 Learning: 2025-11-24T17:46:29.802Z

Learnt from: CR
Repo: Skyvern-AI/skyvern PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-24T17:46:29.802Z
Learning: Use `LLM_KEY` environment variable to specify which LLM model to use

Applied to files:

skyvern/forge/sdk/api/llm/api_handler_factory.py

🧬 Code graph analysis (1)

skyvern/forge/sdk/api/llm/api_handler_factory.py (4)

skyvern/forge/sdk/core/skyvern_context.py (1)

SkyvernContext (10-97)

skyvern/forge/sdk/artifact/manager.py (1)

create_llm_artifact (278-335)

skyvern/forge/sdk/artifact/models.py (1)

ArtifactType (10-53)

skyvern/forge/sdk/api/llm/models.py (1)

LLMAPIHandler (90-107)

🪛 Ruff (0.14.6)

skyvern/forge/sdk/api/llm/api_handler_factory.py

461-461: Avoid specifying long messages outside the exception class

(TRY003)

464-464: Avoid specifying long messages outside the exception class

(TRY003)

509-509: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Optimize new Python code in this PR
GitHub Check: Run tests and pre-commit hooks

🔇 Additional comments (7)

skyvern/forge/sdk/api/llm/api_handler_factory.py (7)

67-85: Centralized hashed-href artifact logging looks correct

The _log_hashed_href_map_artifacts_if_needed helper correctly:

Guards on context, context.hashed_href_map, step, and not is_speculative_step.

Reuses create_llm_artifact so routing via step/task_v2/thought/ai_suggestion stays consistent.

This removes duplication and avoids logging during speculative calls; no issues from a correctness standpoint.

277-283: Override handler behavior comment is aligned with implementation

The new comment documenting that explicit overrides bypass experimentation and use get_llm_api_handler(override_llm_key) matches the implementation. No functional concerns here.

365-372: Reusing hashed-href logging in router handler

Using _log_hashed_href_map_artifacts_if_needed here aligns router-based calls with the direct handler and LLMCaller, and preserves the “no speculative step” guard. This looks correct and avoids repeated inline conditions.

700-704: get_llm_api_handler signature extension

Adding base_parameters to get_llm_api_handler and threading it into LLMCaller construction is a clean extension of the API without breaking existing call sites. No issues here.

761-768: Consistent hashed-href artifact logging in direct handler

Calling _log_hashed_href_map_artifacts_if_needed here brings the single-model handler in line with the router and LLMCaller flows and respects the speculative-step guard. This refactor looks correct and non-breaking.

961-962: Vertex cache hit logging for direct handler

Using _log_vertex_cache_hit_if_needed(context, prompt_name, model_name, cached_tokens) right after computing cached_tokens ensures you only log meaningful cache hits for the extract-actions prompt. This is a lightweight addition with clear value for cache diagnostics.

1141-1148: Hashed-href artifact logging in LLMCaller.call

Extending _log_hashed_href_map_artifacts_if_needed to LLMCaller.call keeps artifact behavior consistent with the other LLM paths and continues to avoid speculative steps. This is a straightforward, correct reuse of the helper.

pedrohsdb added pedrohsdb sync labels Nov 28, 2025

ellipsis-dev bot reviewed Nov 28, 2025

View reviewed changes

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

Merge branch 'main' into pedro/fix-vertex-cache-leak

e3376c2

coderabbitai bot reviewed Nov 29, 2025

View reviewed changes

pedrohsdb merged commit 3f11d44 into main Nov 29, 2025
9 checks passed

pedrohsdb deleted the pedro/fix-vertex-cache-leak branch November 29, 2025 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pedro/fix vertex cache leak #4135

Pedro/fix vertex cache leak #4135

Uh oh!

pedrohsdb commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pedro/fix vertex cache leak #4135

Pedro/fix vertex cache leak #4135

Uh oh!

Conversation

pedrohsdb commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Key Changes

Technical Implementation

Impact

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pedrohsdb commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading