This change introduces a new prompt evolution system inspired by the … #3644

computer2s · 2025-10-08T08:03:59Z

…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability.

The core of this feature is the new skyvern/evolution package, which includes:

PromptManager: A class for managing a population of prompts and their performance scores.
Evolve: A class that uses an LLM to generate new variations of prompts based on their performance.
evolve-prompt.j2: A new prompt template to guide the LLM in the evolution process.

A new script, scripts/run_evolution.py, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts.

The ForgeAgent has been integrated with the PromptManager to dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.

🧬 This PR introduces a prompt evolution system inspired by AlphaEvolve research that automatically improves Skyvern agent prompts through iterative LLM-based evolution and performance scoring. The system creates a feedback loop where the best-performing prompts are selected and evolved to generate better variations over time.

🔍 Detailed Analysis

Key Changes

New Evolution Package: Added skyvern/evolution/ with PromptManager for managing prompt populations and Evolve class for LLM-based prompt generation
Agent Integration: Modified ForgeAgent to dynamically use the best-performing prompt from the evolution system instead of static templates
Evolution Script: Created scripts/run_evolution.py to orchestrate the evolution loop with configurable generations
Evolution Template: Added evolve-prompt.j2 template to guide LLM in generating improved prompt variations
Global State: Integrated PromptManager into the main application state via app.py

Technical Implementation

flowchart TD
    A[PromptManager] --> B[Load Baseline Prompt]
    B --> C[Evolve Class]
    C --> D[Generate Variations via LLM]
    D --> E[Evaluate & Score Prompts]
    E --> F[Select Best Prompt]
    F --> G[ForgeAgent Uses Best Prompt]
    G --> H[Performance Feedback]
    H --> C
    
    I[run_evolution.py] --> J[Evolution Loop]
    J --> C

Impact

Performance Improvement: Continuous optimization of agent prompts based on performance metrics and LLM-generated improvements
Adaptive System: Agent behavior evolves over time, potentially handling edge cases and scenarios better than static prompts
Research Integration: Implements cutting-edge prompt engineering techniques from academic research in a production system
Backward Compatibility: Maintains fallback to original templates if evolution system fails, ensuring system reliability

Created with Palmier

Important

Introduces a prompt evolution system to improve Skyvern agent performance by managing and evolving prompts using LLM.

Behavior:
- Introduces skyvern/evolution package with PromptManager and Evolve classes for managing and evolving prompts.
- Adds evolve-prompt.j2 template for guiding LLM in prompt evolution.
- New script run_evolution.py to run the prompt evolution loop.
- Integrates PromptManager with ForgeAgent to use best-performing prompts.
Classes:
- PromptManager: Manages prompt population and scores.
- Evolve: Generates new prompt variations using LLM.
Files:
- agent.py: Modifies _build_extract_action_prompt() to use evolved prompts.
- app.py: Initializes PROMPT_MANAGER for prompt management.

^{This description was created by}^{for 6bd63e6. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Adaptive prompt evolution that iteratively generates and scores improved prompts, automatically used by the agent when available to enhance task accuracy.
- Added a command to run a prompt-evolution loop for generating better prompts.
- New prompt template for guiding robust prompt evolution.
Refactor
- Agent now renders prompts from either evolved raw templates or named templates with safer fallbacks, improving reliability.

…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability. The core of this feature is the new `skyvern/evolution` package, which includes: - `PromptManager`: A class for managing a population of prompts and their performance scores. - `Evolve`: A class that uses an LLM to generate new variations of prompts based on their performance. - `evolve-prompt.j2`: A new prompt template to guide the LLM in the evolution process. A new script, `scripts/run_evolution.py`, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts. The `ForgeAgent` has been integrated with the `PromptManager` to dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.

coderabbitai · 2025-10-08T08:04:37Z

Walkthrough

Adds a prompt-evolution subsystem: a PromptManager to load/manage prompts, an Evolve class to generate/evaluate evolved prompts via an LLM and a scoring heuristic, a script to run multi-generation evolution, a new evolve-prompt template, and ForgeAgent changes to prefer evolved prompts at runtime with app-level globals initialized.

Changes

Cohort / File(s)	Summary
Evolution package `skyvern/evolution/__init__.py`, `skyvern/evolution/evolve.py`, `skyvern/evolution/prompt_manager.py`	New package for prompt evolution: PromptManager loads baseline and manages prompts; Evolve asynchronously creates evolved prompts via LLM and synchronously scores them; package init added.
Forge integration `skyvern/forge/agent.py`, `skyvern/forge/app.py`	Agent now selects evolved prompt strings when available, otherwise falls back to named templates; adds global PROMPT_MANAGER and agent initialization in app module.
Templates `skyvern/forge/prompts/skyvern/evolve-prompt.j2`	New Jinja2 template guiding LLM to produce an improved prompt given a target prompt.
Evolution runner script `scripts/run_evolution.py`	New async script orchestrating 5 evolution generations: evolve, evaluate/score, log best prompt, and sleep between iterations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User as CLI User
  participant Script as run_evolution.py
  participant PM as PromptManager
  participant Evo as Evolve
  participant LLM as LLM_API_HANDLER
  participant PE as prompt_engine

  User->>Script: python scripts/run_evolution.py
  Script->>PM: init() and load baseline
  Script->>Evo: init(PromptManager)
  loop generations (x5)
    Script->>Evo: evolve_prompts()
    Evo->>PM: get_best_prompt()
    alt best prompt exists
      Evo->>PE: load_prompt("evolve-prompt", target=best.template)
      Evo->>LLM: generate evolved prompt (input: evolution prompt)
      LLM-->>Evo: evolved prompt text
      Evo->>PM: add_prompt(name=evolved_vN, template=..., score=0)
    else
      Evo->>Evo: log warning and return
    end
    Script->>Evo: evaluate_and_score_prompts()
    Evo->>PM: update_score(...) per prompt
    Script->>PM: get_best_prompt() and log
    Script->>Script: asyncio.sleep()
  end

sequenceDiagram
  autonumber
  participant Forge as ForgeAgent
  participant PM as PromptManager
  participant PE as prompt_engine
  participant Render as Renderer

  Forge->>PM: get_best_prompt()
  alt evolved prompt available
    Forge->>PE: load_prompt_from_string(best.template)
  else
    Forge->>PE: load_prompt(".../extract-action.j2" or mapped template)
  end
  PE-->>Render: compiled template
  Render-->>Forge: rendered prompt

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A nibble of bytes, a hop through the night,
I evolve little prompts by moon’s silver light.
Score them, store them—carrot marks out of two,
Forge finds the best for the work it must do.
Thump-thump! New lines sprout—so crisp, so bright.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “This change introduces a new prompt evolution system inspired by the …” is overly verbose and uses filler language and an ellipsis, making it unclear and incomplete; it does not succinctly identify the core change in a single, direct sentence.	Please rewrite the title to be concise and specific, for example: “Add prompt evolution system with PromptManager and Evolve modules” so that it clearly and briefly summarizes the primary change.
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 6bd63e6 in 2 minutes and 26 seconds. Click for details.

Reviewed 361 lines of code in 7 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. scripts/run_evolution.py:52

Draft comment:
End the file with a newline for POSIX compliance.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While this is a real issue, it's very minor and would likely be caught by linters or formatters. Most modern IDEs automatically add trailing newlines. The comment doesn't suggest a critical code change that would affect functionality. The issue is real and technically correct. Missing newlines can cause issues with some UNIX tools and is considered bad form. However, this is exactly the kind of minor, obvious issue that should be handled by automated tools rather than manual review comments. This comment should be removed as it's too minor and would be better handled by automated tooling.

2. skyvern/forge/agent.py:1325

Draft comment:
Consider refactoring the prompt selection fallback logic into a helper to improve clarity and avoid repetition.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_NTbPeRaXsFZ9OvJq

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-10-08T08:06:30Z

skyvern/evolution/evolve.py

@@ -0,0 +1,74 @@
+import structlog
+import random


Remove unused import 'random' if it's not used.

Suggested change

import random

ellipsis-dev · 2025-10-08T08:06:30Z

skyvern/evolution/evolve.py

+
+        # In a real implementation, a 'step' object would be passed here.
+        # This is a placeholder for demonstration purposes.
+        response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)


Consider adding error handling around the LLM_API_HANDLER call to catch unexpected failures.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (8)

skyvern/evolution/prompt_manager.py (2)

30-31: Use structured logging instead of f-strings.

These log statements use f-strings, but structlog supports structured logging with keyword arguments that provide better machine-readability and context.

Based on learnings, apply these changes:

-            self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
-            LOG.info("Loaded baseline prompt 'extract-action.j2'.")
+            self.add_prompt("baseline", baseline_template, score=1.0)
+            LOG.info("Loaded baseline prompt", template_name="extract-action.j2")

-            LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
+            LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)

-            LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")
+            LOG.warning("Prompt already exists, overwriting", name=name)

-        LOG.info(f"Added prompt '{name}' with score {score}.")
+        LOG.info("Added prompt", name=name, score=score)

-            LOG.info(f"Updated score for prompt '{name}' to {score}.")
+            LOG.info("Updated prompt score", name=name, score=score)

-            LOG.warning(f"Prompt '{name}' not found for score update.")
+            LOG.warning("Prompt not found for score update", name=name)

Also applies to: 40-40, 43-43, 66-66, 68-68

32-33: Narrow the exception handling.

Catching all exceptions with a bare except Exception is too broad and may hide unexpected errors. Consider catching specific exceptions like jinja2.TemplateNotFound or OSError.

+        from jinja2 import TemplateNotFound
+        
         try:
             # Access the Jinja2 environment from the prompt_engine
             env = prompt_engine.env
             # Construct the path to the template within the Jinja2 environment
             template_path = "skyvern/extract-action.j2"
             # Get the template source from the loader
             baseline_template = env.loader.get_source(env, template_path)[0]
 
-            self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
+            self.add_prompt("baseline", baseline_template, score=1.0)
             LOG.info("Loaded baseline prompt 'extract-action.j2'.")
-        except Exception as e:
-            LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
+        except (TemplateNotFound, OSError) as e:
+            LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)

scripts/run_evolution.py (3)

26-26: Consider making the number of generations configurable.

The hard-coded value num_generations = 5 limits flexibility for experimentation or production use.

Consider adding a command-line argument or environment variable:

+import os
+
 async def main():
     """
     Main function to run the prompt evolution loop.
     """
     LOG.info("Initializing prompt evolution process...")
 
     prompt_manager = PromptManager()
     evolver = Evolve(prompt_manager)
 
     # Check if the baseline prompt was loaded correctly
     if not prompt_manager.get_prompt("baseline"):
         LOG.error("Failed to load baseline prompt. Aborting evolution process.")
         return
 
     LOG.info("Starting evolution loop...")
 
     # Run the evolution loop for a few generations as a demonstration
-    num_generations = 5
+    num_generations = int(os.getenv("EVOLUTION_GENERATIONS", "5"))
     for i in range(num_generations):

30-34: Add error handling for evolution steps.

The evolution loop lacks error handling for potential failures in evolve_prompts() or evaluate_and_score_prompts(). If either fails, the entire loop stops without useful diagnostics.

     for i in range(num_generations):
         LOG.info(f"--- Generation {i+1}/{num_generations} ---")
 
-        # Evolve the prompts to create new variations
-        await evolver.evolve_prompts()
-
-        # Evaluate the performance of the new prompts
-        evolver.evaluate_and_score_prompts()
+        try:
+            # Evolve the prompts to create new variations
+            await evolver.evolve_prompts()
+
+            # Evaluate the performance of the new prompts
+            evolver.evaluate_and_score_prompts()
+        except Exception:
+            LOG.exception("Evolution step failed", generation=i+1)
+            continue

44-44: Document the purpose of the sleep delay.

The 5-second sleep between generations is not explained. Consider documenting why this delay is necessary or making it configurable.

-        # In a real application, you might add a delay or run this as a continuous background process
-        await asyncio.sleep(5)
+        # Brief pause between generations to avoid overwhelming the LLM API
+        # In production, this could be removed or adjusted based on rate limits
+        await asyncio.sleep(int(os.getenv("EVOLUTION_DELAY_SECONDS", "5")))

skyvern/forge/agent.py (2)

1320-1341: Clarify the fallback chain for general tasks.

The fallback logic for general tasks (evolved prompt → baseline → template name) is reasonable, but the "critical error" log message on Line 1339 doesn't immediately raise an exception. This could lead to confusion about whether execution continues.

Consider making the error handling more explicit:

         if task_type == TaskType.general:
             # For general tasks, try to use the best prompt from our evolution manager.
             best_prompt = app.PROMPT_MANAGER.get_best_prompt()
             if best_prompt:
-                LOG.info(f"Using evolved prompt: {best_prompt.name} with score {best_prompt.score}")
+                LOG.info("Using evolved prompt", prompt_name=best_prompt.name, score=best_prompt.score)
                 template_str = best_prompt.template
             else:
                 # If no evolved prompts, fall back to the baseline prompt.
                 LOG.warning("PromptManager has no prompts. Falling back to baseline 'extract-action'.")
                 baseline_prompt = app.PROMPT_MANAGER.get_prompt("baseline")
                 if baseline_prompt:
                     template_str = baseline_prompt.template
                 else:
                     # If even the baseline is missing, this is a critical error.
-                    LOG.error("Baseline prompt could not be loaded from PromptManager.")
-                    # As a last resort, use the template name.
+                    LOG.critical("Baseline prompt could not be loaded from PromptManager, using template name fallback")
+                    # As a last resort, use the template name (may fail if template file is missing)
                     template_name = "extract-action"

1382-1396: Add defensive check before rendering.

The code assumes at least one of template_str or template_name is set, but this isn't guaranteed if the logic above changes. Adding an explicit check after Line 1380 would make the code more robust.

             complete_criterion=task.complete_criterion,
             terminate_criterion=task.terminate_criterion,
         )
 
+        # Ensure at least one rendering path is available
+        if template_str is None and template_name is None:
+            raise UnsupportedTaskType(task_type=task_type)
+
         if template_str is not None:
             # Render the prompt from a raw string (used for evolved prompts)
             return prompt_engine.load_prompt_from_string(
                 template=template_str,
                 **render_kwargs,
             )
 
         if template_name is not None:
             # Render the prompt from a template file by name (standard behavior)
             return prompt_engine.load_prompt(
                 template=template_name,
                 **render_kwargs,
             )
 
-        raise UnsupportedTaskType(task_type=task_type)

skyvern/evolution/evolve.py (1)

23-23: Use structured logging instead of f-strings.

These log statements use f-strings for interpolation. Structlog supports structured logging with keyword arguments for better machine-readability.

Based on learnings:

-        LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")
+        LOG.info("Evolving prompt", prompt_name=best_prompt.name, score=best_prompt.score)

-        LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")
+        LOG.info("Evolved new prompt", prompt_name=new_prompt_name, preview=evolved_prompt_str[:100])

-            LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")
+            LOG.info("Evaluated prompt", prompt_name=name, score=normalized_score)

Also applies to: 43-43, 74-74

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c3b548 and 6bd63e6.

📒 Files selected for processing (7)

scripts/run_evolution.py (1 hunks)
skyvern/evolution/__init__.py (1 hunks)
skyvern/evolution/evolve.py (1 hunks)
skyvern/evolution/prompt_manager.py (1 hunks)
skyvern/forge/agent.py (3 hunks)
skyvern/forge/app.py (2 hunks)
skyvern/forge/prompts/skyvern/evolve-prompt.j2 (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

{skyvern,integrations,alembic,scripts}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{skyvern,integrations,alembic,scripts}/**/*.py: Use Python 3.11+ features and add type hints throughout the codebase
Follow PEP 8 with a maximum line length of 100 characters
Use absolute imports for all Python modules
Document all public functions and classes with Google-style docstrings
Use snake_case for variables and functions, and PascalCase for classes
Prefer async/await over callbacks in asynchronous code
Use asyncio for concurrency
Always handle exceptions in async code
Use context managers for resource cleanup
Use specific exception classes
Include meaningful error messages when raising or logging exceptions
Log errors with appropriate severity levels
Never expose sensitive information in error messages

Files:

skyvern/evolution/__init__.py
skyvern/evolution/evolve.py
skyvern/forge/agent.py
scripts/run_evolution.py
skyvern/evolution/prompt_manager.py
skyvern/forge/app.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python code must be linted and formatted with Ruff
Use type hints throughout Python code
Prefer async/await for asynchronous Python code
Enforce a maximum line length of 120 characters in Python files

Files:

skyvern/evolution/__init__.py
skyvern/evolution/evolve.py
skyvern/forge/agent.py
scripts/run_evolution.py
skyvern/evolution/prompt_manager.py
skyvern/forge/app.py

skyvern/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Type-check Python code in the skyvern/ package with mypy

Files:

skyvern/evolution/__init__.py
skyvern/evolution/evolve.py
skyvern/forge/agent.py
skyvern/evolution/prompt_manager.py
skyvern/forge/app.py

🧬 Code graph analysis (4)

skyvern/evolution/evolve.py (2)

skyvern/evolution/prompt_manager.py (3)

get_best_prompt (51-58)

add_prompt (35-43)

update_score (60-68)

skyvern/forge/sdk/prompting.py (1)

load_prompt (89-111)

skyvern/forge/agent.py (6)

skyvern/forge/sdk/db/enums.py (1)

TaskType (8-11)

skyvern/evolution/prompt_manager.py (2)

get_best_prompt (51-58)

get_prompt (45-49)

skyvern/webeye/actions/actions.py (1)

ActionType (13-37)

skyvern/exceptions.py (2)

UnsupportedActionType (322-324)

UnsupportedTaskType (545-547)

skyvern/forge/sdk/core/skyvern_context.py (1)

ensure_context (43-56)

skyvern/forge/sdk/prompting.py (2)

load_prompt_from_string (113-134)

load_prompt (89-111)

scripts/run_evolution.py (2)

skyvern/evolution/evolve.py (3)

Evolve (9-74)

evolve_prompts (14-43)

evaluate_and_score_prompts (45-74)

skyvern/evolution/prompt_manager.py (3)

PromptManager (13-68)

get_prompt (45-49)

get_best_prompt (51-58)

skyvern/forge/app.py (1)

skyvern/evolution/prompt_manager.py (1)

PromptManager (13-68)

🔇 Additional comments (6)

skyvern/evolution/__init__.py (1)

1-1: LGTM!

Standard Python package initialization file. The comment clearly indicates the intentional design.

skyvern/forge/app.py (1)

47-48: Module-level initialization is appropriate here.

The PROMPT_MANAGER and agent are initialized at module level, consistent with the pattern used for other application globals in this file (e.g., DATABASE, BROWSER_MANAGER). This ensures they're available throughout the application lifecycle.

Note: Ensure that any initialization errors in PromptManager (like baseline loading failures) are logged appropriately, as they will occur during module import.

skyvern/forge/prompts/skyvern/evolve-prompt.j2 (1)

1-17: Well-structured evolution template.

The template provides clear guidance for prompt evolution with:

Role definition for the LLM

Explicit principles (clarity, role-setting, context, action-oriented, robustness)

Clear output format instruction (no extra text)

This aligns well with the evolution workflow described in the PR objectives.

skyvern/forge/agent.py (1)

1352-1353: Good variable renaming for clarity.

Renaming action_type to action_type_str before converting it to the enum improves code readability by making the type transformation explicit.

skyvern/evolution/evolve.py (2)

45-74: Scoring logic is simplistic but acceptable for demonstration.

The evaluate_and_score_prompts method uses a deterministic heuristic (length and keyword matching) rather than actual benchmark results. The docstring acknowledges this is a simulation, which is appropriate for a proof-of-concept.

For production use, consider replacing this with actual performance metrics from agent runs.

33-33: LLM_API_HANDLER called with step=None At evolve.py:33, step=None is a placeholder—verify the handler accepts a null step or pass a valid step object.

coderabbitai · 2025-10-08T08:10:17Z

scripts/run_evolution.py

+async def main():
+    """
+    Main function to run the prompt evolution loop.
+    """
+    LOG.info("Initializing prompt evolution process...")
+
+    prompt_manager = PromptManager()
+    evolver = Evolve(prompt_manager)
+
+    # Check if the baseline prompt was loaded correctly
+    if not prompt_manager.get_prompt("baseline"):
+        LOG.error("Failed to load baseline prompt. Aborting evolution process.")
+        return
+
+    LOG.info("Starting evolution loop...")
+
+    # Run the evolution loop for a few generations as a demonstration
+    num_generations = 5
+    for i in range(num_generations):
+        LOG.info(f"--- Generation {i+1}/{num_generations} ---")
+
+        # Evolve the prompts to create new variations
+        await evolver.evolve_prompts()
+
+        # Evaluate the performance of the new prompts
+        evolver.evaluate_and_score_prompts()
+
+        # Log the best prompt of the current generation
+        best_prompt = prompt_manager.get_best_prompt()
+        if best_prompt:
+            LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")
+        else:
+            LOG.warning("No prompts in manager after evolution and evaluation.")
+
+        # In a real application, you might add a delay or run this as a continuous background process
+        await asyncio.sleep(5)
+
+    LOG.info("Evolution loop finished.")


🛠️ Refactor suggestion | 🟠 Major

Add type hints to the main function.

The main() function is missing type hints for its return value, which is required by the coding guidelines.

As per coding guidelines:

-async def main(): +async def main() -> None: """ Main function to run the prompt evolution loop. """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def main():

"""

Main function to run the prompt evolution loop.

"""

LOG.info("Initializing prompt evolution process...")

prompt_manager = PromptManager()

evolver = Evolve(prompt_manager)

# Check if the baseline prompt was loaded correctly

if not prompt_manager.get_prompt("baseline"):

LOG.error("Failed to load baseline prompt. Aborting evolution process.")

return

LOG.info("Starting evolution loop...")

# Run the evolution loop for a few generations as a demonstration

num_generations = 5

for i in range(num_generations):

LOG.info(f"--- Generation {i+1}/{num_generations} ---")

# Evolve the prompts to create new variations

await evolver.evolve_prompts()

# Evaluate the performance of the new prompts

evolver.evaluate_and_score_prompts()

# Log the best prompt of the current generation

best_prompt = prompt_manager.get_best_prompt()

if best_prompt:

LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")

else:

LOG.warning("No prompts in manager after evolution and evaluation.")

# In a real application, you might add a delay or run this as a continuous background process

await asyncio.sleep(5)

LOG.info("Evolution loop finished.")

async def main() -> None:

"""

Main function to run the prompt evolution loop.

"""

LOG.info("Initializing prompt evolution process...")

prompt_manager = PromptManager()

evolver = Evolve(prompt_manager)

# Check if the baseline prompt was loaded correctly

if not prompt_manager.get_prompt("baseline"):

LOG.error("Failed to load baseline prompt. Aborting evolution process.")

return

LOG.info("Starting evolution loop...")

# Run the evolution loop for a few generations as a demonstration

num_generations = 5

for i in range(num_generations):

LOG.info(f"--- Generation {i+1}/{num_generations} ---")

# Evolve the prompts to create new variations

await evolver.evolve_prompts()

# Evaluate the performance of the new prompts

evolver.evaluate_and_score_prompts()

# Log the best prompt of the current generation

best_prompt = prompt_manager.get_best_prompt()

if best_prompt:

LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")

else:

LOG.warning("No prompts in manager after evolution and evaluation.")

# In a real application, you might add a delay or run this as a continuous background process

await asyncio.sleep(5)

LOG.info("Evolution loop finished.")

🤖 Prompt for AI Agents

In scripts/run_evolution.py around lines 9 to 46, the async main() function is missing a return type annotation; update its signature to include an explicit return type (async def main() -> None:) to satisfy the coding guidelines and ensure the coroutine is annotated as returning None; no other behavioral changes are required.

coderabbitai · 2025-10-08T08:10:17Z

skyvern/evolution/evolve.py

@@ -0,0 +1,74 @@
+import structlog
+import random


🛠️ Refactor suggestion | 🟠 Major

Remove unused import.

The random module is imported but never used in this file.

import structlog -import random from skyvern.forge.prompts import prompt_engine

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import random

import structlog

from skyvern.forge.prompts import prompt_engine

🤖 Prompt for AI Agents

In skyvern/evolution/evolve.py around line 2, the file imports the random module which is unused; remove the unused import statement (delete or comment out the "import random" line) to clean up imports and avoid linter warnings.

coderabbitai · 2025-10-08T08:10:17Z

skyvern/evolution/evolve.py

+class Evolve:
+    def __init__(self, prompt_manager):
+        self.prompt_manager = prompt_manager
+        self.evolution_count = 0
+
+    async def evolve_prompts(self):
+        """
+        Takes the top-performing prompts and uses an LLM to generate new variations.
+        """
+        best_prompt = self.prompt_manager.get_best_prompt()
+        if not best_prompt:
+            LOG.warning("No prompts found to evolve.")
+            return
+
+        LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")
+
+        # Use an LLM to generate a new variation of the prompt.
+        evolution_prompt = prompt_engine.load_prompt(
+            "evolve-prompt",
+            prompt_to_evolve=best_prompt.template,
+        )
+
+        # In a real implementation, a 'step' object would be passed here.
+        # This is a placeholder for demonstration purposes.
+        response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)
+
+        # Assuming the response is the raw string of the new prompt
+        evolved_prompt_str = response if isinstance(response, str) else str(response)
+
+        # Add the new prompt to the population
+        self.evolution_count += 1
+        new_prompt_name = f"evolved_v{self.evolution_count}"
+        self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)
+
+        LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")
+
+    def evaluate_and_score_prompts(self):
+        """
+        Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
+        In a real-world scenario, this would involve running benchmarks.
+        """
+        LOG.info("Evaluating and scoring prompts...")
+        for name, prompt in self.prompt_manager.prompts.items():
+            # Skip the baseline prompt as its score is fixed.
+            if name == "baseline":
+                continue
+
+            score = 0
+            # Score based on length (ideal length between 500 and 1500 characters)
+            length = len(prompt.template)
+            if 500 <= length <= 1500:
+                score += 0.5
+            else:
+                score -= 0.2
+
+            # Score based on presence of keywords
+            keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]
+            for keyword in keywords:
+                if keyword in prompt.template.lower():
+                    score += 0.2
+
+            # Normalize score to be between 0 and 2 for this simulation
+            normalized_score = max(0, min(2, score))
+
+            self.prompt_manager.update_score(name, normalized_score)
+            LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")


🛠️ Refactor suggestion | 🟠 Major

Add type hints to the Evolve class.

The entire Evolve class is missing type hints for method parameters and return values, which violates the coding guidelines for Python 3.11+.

As per coding guidelines:

+from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from skyvern.evolution.prompt_manager import PromptManager + class Evolve: - def __init__(self, prompt_manager): + def __init__(self, prompt_manager: "PromptManager") -> None: self.prompt_manager = prompt_manager self.evolution_count = 0 - async def evolve_prompts(self): + async def evolve_prompts(self) -> None: """ Takes the top-performing prompts and uses an LLM to generate new variations. """ # ... rest of method - def evaluate_and_score_prompts(self): + def evaluate_and_score_prompts(self) -> None: """ Simulates the evaluation of prompts and updates their scores based on deterministic criteria. In a real-world scenario, this would involve running benchmarks. """ # ... rest of method

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class Evolve:

def __init__(self, prompt_manager):

self.prompt_manager = prompt_manager

self.evolution_count = 0

async def evolve_prompts(self):

"""

Takes the top-performing prompts and uses an LLM to generate new variations.

"""

best_prompt = self.prompt_manager.get_best_prompt()

if not best_prompt:

LOG.warning("No prompts found to evolve.")

return

LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")

# Use an LLM to generate a new variation of the prompt.

evolution_prompt = prompt_engine.load_prompt(

"evolve-prompt",

prompt_to_evolve=best_prompt.template,

)

# In a real implementation, a 'step' object would be passed here.

# This is a placeholder for demonstration purposes.

response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)

# Assuming the response is the raw string of the new prompt

evolved_prompt_str = response if isinstance(response, str) else str(response)

# Add the new prompt to the population

self.evolution_count += 1

new_prompt_name = f"evolved_v{self.evolution_count}"

self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)

LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")

def evaluate_and_score_prompts(self):

"""

Simulates the evaluation of prompts and updates their scores based on deterministic criteria.

In a real-world scenario, this would involve running benchmarks.

"""

LOG.info("Evaluating and scoring prompts...")

for name, prompt in self.prompt_manager.prompts.items():

# Skip the baseline prompt as its score is fixed.

if name == "baseline":

continue

score = 0

# Score based on length (ideal length between 500 and 1500 characters)

length = len(prompt.template)

if 500 <= length <= 1500:

score += 0.5

else:

score -= 0.2

# Score based on presence of keywords

keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]

for keyword in keywords:

if keyword in prompt.template.lower():

score += 0.2

# Normalize score to be between 0 and 2 for this simulation

normalized_score = max(0, min(2, score))

self.prompt_manager.update_score(name, normalized_score)

LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")

from typing import TYPE_CHECKING

if TYPE_CHECKING:

from skyvern.evolution.prompt_manager import PromptManager

class Evolve:

def __init__(self, prompt_manager: "PromptManager") -> None:

self.prompt_manager = prompt_manager

self.evolution_count = 0

async def evolve_prompts(self) -> None:

"""

Takes the top-performing prompts and uses an LLM to generate new variations.

"""

best_prompt = self.prompt_manager.get_best_prompt()

if not best_prompt:

LOG.warning("No prompts found to evolve.")

return

LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")

# Use an LLM to generate a new variation of the prompt.

evolution_prompt = prompt_engine.load_prompt(

"evolve-prompt",

prompt_to_evolve=best_prompt.template,

)

# In a real implementation, a 'step' object would be passed here.

# This is a placeholder for demonstration purposes.

response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)

# Assuming the response is the raw string of the new prompt

evolved_prompt_str = response if isinstance(response, str) else str(response)

# Add the new prompt to the population

self.evolution_count += 1

new_prompt_name = f"evolved_v{self.evolution_count}"

self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)

LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")

def evaluate_and_score_prompts(self) -> None:

"""

Simulates the evaluation of prompts and updates their scores based on deterministic criteria.

In a real-world scenario, this would involve running benchmarks.

"""

LOG.info("Evaluating and scoring prompts...")

for name, prompt in self.prompt_manager.prompts.items():

# Skip the baseline prompt as its score is fixed.

if name == "baseline":

continue

score = 0

# Score based on length (ideal length between 500 and 1500 characters)

length = len(prompt.template)

if 500 <= length <= 1500:

score += 0.5

else:

score -= 0.2

# Score based on presence of keywords

keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]

for keyword in keywords:

if keyword in prompt.template.lower():

score += 0.2

# Normalize score to be between 0 and 2 for this simulation

normalized_score = max(0, min(2, score))

self.prompt_manager.update_score(name, normalized_score)

LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")

🤖 Prompt for AI Agents

In skyvern/evolution/evolve.py around lines 9-74, the Evolve class and its methods lack Python 3.11+ type hints; add explicit type annotations for the class attributes and method signatures: annotate __init__ to accept prompt_manager: "PromptManager" (use a forward reference or import the PromptManager type), self.evolution_count: int, and self.prompt_manager: "PromptManager"; annotate async def evolve_prompts(self) -> None and def evaluate_and_score_prompts(self) -> None; annotate local variables where helpful (e.g., best_prompt: Optional[Prompt], response: Any, evolved_prompt_str: str, new_prompt_name: str, score: float, normalized_score: float) and ensure you import necessary typing items (Optional, Any, Optional["Prompt"] or a Prompt type, and if needed Coroutine) or reference existing project types; update function and variable annotations accordingly without changing logic.

coderabbitai · 2025-10-08T08:10:18Z

skyvern/evolution/prompt_manager.py

+class Prompt:
+    def __init__(self, name, template, score=0):
+        self.name = name
+        self.template = template
+        self.score = score


🛠️ Refactor suggestion | 🟠 Major

Add type hints and class docstring.

The Prompt class is missing type hints for its __init__ parameters and lacks a class-level docstring describing its purpose.

As per coding guidelines, apply this diff:

class Prompt: + """ + Represents a prompt template with its associated metadata. + + Attributes: + name: Unique identifier for the prompt. + template: The Jinja2 template string. + score: Performance score for ranking (default: 0). + """ - def __init__(self, name, template, score=0): + def __init__(self, name: str, template: str, score: float = 0) -> None: self.name = name self.template = template self.score = score

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class Prompt:

def __init__(self, name, template, score=0):

self.name = name

self.template = template

self.score = score

class Prompt:

"""

Represents a prompt template with its associated metadata.

Attributes:

name: Unique identifier for the prompt.

template: The Jinja2 template string.

score: Performance score for ranking (default: 0).

"""

def __init__(self, name: str, template: str, score: float = 0) -> None:

self.name = name

self.template = template

self.score = score

🤖 Prompt for AI Agents

In skyvern/evolution/prompt_manager.py around lines 7 to 11, the Prompt class lacks a class-level docstring and type hints; add a concise docstring explaining that Prompt represents a named prompt template with an associated score, annotate the class attributes (name: str, template: str, score: int = 0) and update the __init__ signature to use type hints (def __init__(self, name: str, template: str, score: int = 0) -> None:) so static type checkers and IDEs can validate usage.

coderabbitai · 2025-10-08T08:10:18Z

skyvern/evolution/prompt_manager.py

+class PromptManager:
+    def __init__(self):
+        self.prompts = {}
+        self._load_baseline_prompt()
+
+    def _load_baseline_prompt(self):
+        """
+        Loads the original 'extract-action.j2' prompt as the baseline.
+        """
+        try:
+            # Access the Jinja2 environment from the prompt_engine
+            env = prompt_engine.env
+            # Construct the path to the template within the Jinja2 environment
+            template_path = "skyvern/extract-action.j2"
+            # Get the template source from the loader
+            baseline_template = env.loader.get_source(env, template_path)[0]
+
+            self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
+            LOG.info("Loaded baseline prompt 'extract-action.j2'.")
+        except Exception as e:
+            LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
+
+    def add_prompt(self, name, template, score=0):
+        """
+        Adds a new prompt to the population.
+        """
+        if name in self.prompts:
+            LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")
+
+        self.prompts[name] = Prompt(name, template, score)
+        LOG.info(f"Added prompt '{name}' with score {score}.")
+
+    def get_prompt(self, name):
+        """
+        Retrieves a prompt object by its name.
+        """
+        return self.prompts.get(name)
+
+    def get_best_prompt(self):
+        """
+        Returns the prompt with the highest score.
+        """
+        if not self.prompts:
+            return None
+
+        return max(self.prompts.values(), key=lambda p: p.score)
+
+    def update_score(self, name, score):
+        """
+        Updates the score of a prompt after evaluation.
+        """
+        if name in self.prompts:
+            self.prompts[name].score = score
+            LOG.info(f"Updated score for prompt '{name}' to {score}.")
+        else:
+            LOG.warning(f"Prompt '{name}' not found for score update.")


🛠️ Refactor suggestion | 🟠 Major

Add type hints to all methods.

The PromptManager class methods are missing type hints for parameters and return values, which is required by the coding guidelines for Python 3.11+.

As per coding guidelines, apply these changes:

+from typing import Optional + class PromptManager: - def __init__(self): + def __init__(self) -> None: self.prompts = {} self._load_baseline_prompt() - def _load_baseline_prompt(self): + def _load_baseline_prompt(self) -> None: """ Loads the original 'extract-action.j2' prompt as the baseline. """ # ... rest of method - def add_prompt(self, name, template, score=0): + def add_prompt(self, name: str, template: str, score: float = 0) -> None: """ Adds a new prompt to the population. """ # ... rest of method - def get_prompt(self, name): + def get_prompt(self, name: str) -> Optional[Prompt]: """ Retrieves a prompt object by its name. """ return self.prompts.get(name) - def get_best_prompt(self): + def get_best_prompt(self) -> Optional[Prompt]: """ Returns the prompt with the highest score. """ # ... rest of method - def update_score(self, name, score): + def update_score(self, name: str, score: float) -> None: """ Updates the score of a prompt after evaluation. """ # ... rest of method

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class PromptManager:

def __init__(self):

self.prompts = {}

self._load_baseline_prompt()

def _load_baseline_prompt(self):

"""

Loads the original 'extract-action.j2' prompt as the baseline.

"""

try:

# Access the Jinja2 environment from the prompt_engine

env = prompt_engine.env

# Construct the path to the template within the Jinja2 environment

template_path = "skyvern/extract-action.j2"

# Get the template source from the loader

baseline_template = env.loader.get_source(env, template_path)[0]

self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.

LOG.info("Loaded baseline prompt 'extract-action.j2'.")

except Exception as e:

LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)

def add_prompt(self, name, template, score=0):

"""

Adds a new prompt to the population.

"""

if name in self.prompts:

LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")

self.prompts[name] = Prompt(name, template, score)

LOG.info(f"Added prompt '{name}' with score {score}.")

def get_prompt(self, name):

"""

Retrieves a prompt object by its name.

"""

return self.prompts.get(name)

def get_best_prompt(self):

"""

Returns the prompt with the highest score.

"""

if not self.prompts:

return None

return max(self.prompts.values(), key=lambda p: p.score)

def update_score(self, name, score):

"""

Updates the score of a prompt after evaluation.

"""

if name in self.prompts:

self.prompts[name].score = score

LOG.info(f"Updated score for prompt '{name}' to {score}.")

else:

LOG.warning(f"Prompt '{name}' not found for score update.")

from typing import Optional

class PromptManager:

def __init__(self) -> None:

self.prompts: dict[str, Prompt] = {}

self._load_baseline_prompt()

def _load_baseline_prompt(self) -> None:

"""

Loads the original 'extract-action.j2' prompt as the baseline.

"""

try:

# Access the Jinja2 environment from the prompt_engine

env = prompt_engine.env

# Construct the path to the template within the Jinja2 environment

template_path = "skyvern/extract-action.j2"

# Get the template source from the loader

baseline_template = env.loader.get_source(env, template_path)[0]

self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.

LOG.info("Loaded baseline prompt 'extract-action.j2'.")

except Exception as e:

LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)

def add_prompt(self, name: str, template: str, score: float = 0) -> None:

"""

Adds a new prompt to the population.

"""

if name in self.prompts:

LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")

self.prompts[name] = Prompt(name, template, score)

LOG.info(f"Added prompt '{name}' with score {score}.")

def get_prompt(self, name: str) -> Optional[Prompt]:

"""

Retrieves a prompt object by its name.

"""

return self.prompts.get(name)

def get_best_prompt(self) -> Optional[Prompt]:

"""

Returns the prompt with the highest score.

"""

if not self.prompts:

return None

return max(self.prompts.values(), key=lambda p: p.score)

def update_score(self, name: str, score: float) -> None:

"""

Updates the score of a prompt after evaluation.

"""

if name in self.prompts:

self.prompts[name].score = score

LOG.info(f"Updated score for prompt '{name}' to {score}.")

else:

LOG.warning(f"Prompt '{name}' not found for score update.")

ellipsis-dev bot reviewed Oct 8, 2025

View reviewed changes

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This change introduces a new prompt evolution system inspired by the … #3644

This change introduces a new prompt evolution system inspired by the … #3644

computer2s commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Oct 8, 2025

Uh oh!

ellipsis-dev bot Oct 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

This change introduces a new prompt evolution system inspired by the … #3644

Are you sure you want to change the base?

This change introduces a new prompt evolution system inspired by the … #3644

Conversation

computer2s commented Oct 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Technical Implementation

Impact

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

computer2s commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading