Skip to content

Conversation

@computer2s
Copy link

@computer2s computer2s commented Oct 8, 2025

…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability.

The core of this feature is the new skyvern/evolution package, which includes:

  • PromptManager: A class for managing a population of prompts and their performance scores.
  • Evolve: A class that uses an LLM to generate new variations of prompts based on their performance.
  • evolve-prompt.j2: A new prompt template to guide the LLM in the evolution process.

A new script, scripts/run_evolution.py, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts.

The ForgeAgent has been integrated with the PromptManager to dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.


🧬 This PR introduces a prompt evolution system inspired by AlphaEvolve research that automatically improves Skyvern agent prompts through iterative LLM-based evolution and performance scoring. The system creates a feedback loop where the best-performing prompts are selected and evolved to generate better variations over time.

🔍 Detailed Analysis

Key Changes

  • New Evolution Package: Added skyvern/evolution/ with PromptManager for managing prompt populations and Evolve class for LLM-based prompt generation
  • Agent Integration: Modified ForgeAgent to dynamically use the best-performing prompt from the evolution system instead of static templates
  • Evolution Script: Created scripts/run_evolution.py to orchestrate the evolution loop with configurable generations
  • Evolution Template: Added evolve-prompt.j2 template to guide LLM in generating improved prompt variations
  • Global State: Integrated PromptManager into the main application state via app.py

Technical Implementation

flowchart TD
    A[PromptManager] --> B[Load Baseline Prompt]
    B --> C[Evolve Class]
    C --> D[Generate Variations via LLM]
    D --> E[Evaluate & Score Prompts]
    E --> F[Select Best Prompt]
    F --> G[ForgeAgent Uses Best Prompt]
    G --> H[Performance Feedback]
    H --> C
    
    I[run_evolution.py] --> J[Evolution Loop]
    J --> C
Loading

Impact

  • Performance Improvement: Continuous optimization of agent prompts based on performance metrics and LLM-generated improvements
  • Adaptive System: Agent behavior evolves over time, potentially handling edge cases and scenarios better than static prompts
  • Research Integration: Implements cutting-edge prompt engineering techniques from academic research in a production system
  • Backward Compatibility: Maintains fallback to original templates if evolution system fails, ensuring system reliability

Created with Palmier


Important

Introduces a prompt evolution system to improve Skyvern agent performance by managing and evolving prompts using LLM.

  • Behavior:
    • Introduces skyvern/evolution package with PromptManager and Evolve classes for managing and evolving prompts.
    • Adds evolve-prompt.j2 template for guiding LLM in prompt evolution.
    • New script run_evolution.py to run the prompt evolution loop.
    • Integrates PromptManager with ForgeAgent to use best-performing prompts.
  • Classes:
    • PromptManager: Manages prompt population and scores.
    • Evolve: Generates new prompt variations using LLM.
  • Files:
    • agent.py: Modifies _build_extract_action_prompt() to use evolved prompts.
    • app.py: Initializes PROMPT_MANAGER for prompt management.

This description was created by Ellipsis for 6bd63e6. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features
    • Adaptive prompt evolution that iteratively generates and scores improved prompts, automatically used by the agent when available to enhance task accuracy.
    • Added a command to run a prompt-evolution loop for generating better prompts.
    • New prompt template for guiding robust prompt evolution.
  • Refactor
    • Agent now renders prompts from either evolved raw templates or named templates with safer fallbacks, improving reliability.

…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability.

The core of this feature is the new `skyvern/evolution` package, which includes:
- `PromptManager`: A class for managing a population of prompts and their performance scores.
- `Evolve`: A class that uses an LLM to generate new variations of prompts based on their performance.
- `evolve-prompt.j2`: A new prompt template to guide the LLM in the evolution process.

A new script, `scripts/run_evolution.py`, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts.

The `ForgeAgent` has been integrated with the `PromptManager` to dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

Walkthrough

Adds a prompt-evolution subsystem: a PromptManager to load/manage prompts, an Evolve class to generate/evaluate evolved prompts via an LLM and a scoring heuristic, a script to run multi-generation evolution, a new evolve-prompt template, and ForgeAgent changes to prefer evolved prompts at runtime with app-level globals initialized.

Changes

Cohort / File(s) Summary
Evolution package
skyvern/evolution/__init__.py, skyvern/evolution/evolve.py, skyvern/evolution/prompt_manager.py
New package for prompt evolution: PromptManager loads baseline and manages prompts; Evolve asynchronously creates evolved prompts via LLM and synchronously scores them; package init added.
Forge integration
skyvern/forge/agent.py, skyvern/forge/app.py
Agent now selects evolved prompt strings when available, otherwise falls back to named templates; adds global PROMPT_MANAGER and agent initialization in app module.
Templates
skyvern/forge/prompts/skyvern/evolve-prompt.j2
New Jinja2 template guiding LLM to produce an improved prompt given a target prompt.
Evolution runner script
scripts/run_evolution.py
New async script orchestrating 5 evolution generations: evolve, evaluate/score, log best prompt, and sleep between iterations.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User as CLI User
  participant Script as run_evolution.py
  participant PM as PromptManager
  participant Evo as Evolve
  participant LLM as LLM_API_HANDLER
  participant PE as prompt_engine

  User->>Script: python scripts/run_evolution.py
  Script->>PM: init() and load baseline
  Script->>Evo: init(PromptManager)
  loop generations (x5)
    Script->>Evo: evolve_prompts()
    Evo->>PM: get_best_prompt()
    alt best prompt exists
      Evo->>PE: load_prompt("evolve-prompt", target=best.template)
      Evo->>LLM: generate evolved prompt (input: evolution prompt)
      LLM-->>Evo: evolved prompt text
      Evo->>PM: add_prompt(name=evolved_vN, template=..., score=0)
    else
      Evo->>Evo: log warning and return
    end
    Script->>Evo: evaluate_and_score_prompts()
    Evo->>PM: update_score(...) per prompt
    Script->>PM: get_best_prompt() and log
    Script->>Script: asyncio.sleep()
  end
Loading
sequenceDiagram
  autonumber
  participant Forge as ForgeAgent
  participant PM as PromptManager
  participant PE as prompt_engine
  participant Render as Renderer

  Forge->>PM: get_best_prompt()
  alt evolved prompt available
    Forge->>PE: load_prompt_from_string(best.template)
  else
    Forge->>PE: load_prompt(".../extract-action.j2" or mapped template)
  end
  PE-->>Render: compiled template
  Render-->>Forge: rendered prompt
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A nibble of bytes, a hop through the night,
I evolve little prompts by moon’s silver light.
Score them, store them—carrot marks out of two,
Forge finds the best for the work it must do.
Thump-thump! New lines sprout—so crisp, so bright.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “This change introduces a new prompt evolution system inspired by the …” is overly verbose and uses filler language and an ellipsis, making it unclear and incomplete; it does not succinctly identify the core change in a single, direct sentence. Please rewrite the title to be concise and specific, for example: “Add prompt evolution system with PromptManager and Evolve modules” so that it clearly and briefly summarizes the primary change.
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 6bd63e6 in 2 minutes and 26 seconds. Click for details.
  • Reviewed 361 lines of code in 7 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. scripts/run_evolution.py:52
  • Draft comment:
    End the file with a newline for POSIX compliance.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While this is a real issue, it's very minor and would likely be caught by linters or formatters. Most modern IDEs automatically add trailing newlines. The comment doesn't suggest a critical code change that would affect functionality. The issue is real and technically correct. Missing newlines can cause issues with some UNIX tools and is considered bad form. However, this is exactly the kind of minor, obvious issue that should be handled by automated tools rather than manual review comments. This comment should be removed as it's too minor and would be better handled by automated tooling.
2. skyvern/forge/agent.py:1325
  • Draft comment:
    Consider refactoring the prompt selection fallback logic into a helper to improve clarity and avoid repetition.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_NTbPeRaXsFZ9OvJq

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@@ -0,0 +1,74 @@
import structlog
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import 'random' if it's not used.

Suggested change
import random


# In a real implementation, a 'step' object would be passed here.
# This is a placeholder for demonstration purposes.
response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling around the LLM_API_HANDLER call to catch unexpected failures.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
skyvern/evolution/prompt_manager.py (2)

30-31: Use structured logging instead of f-strings.

These log statements use f-strings, but structlog supports structured logging with keyword arguments that provide better machine-readability and context.

Based on learnings, apply these changes:

-            self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
-            LOG.info("Loaded baseline prompt 'extract-action.j2'.")
+            self.add_prompt("baseline", baseline_template, score=1.0)
+            LOG.info("Loaded baseline prompt", template_name="extract-action.j2")
-            LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
+            LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)
-            LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")
+            LOG.warning("Prompt already exists, overwriting", name=name)
-        LOG.info(f"Added prompt '{name}' with score {score}.")
+        LOG.info("Added prompt", name=name, score=score)
-            LOG.info(f"Updated score for prompt '{name}' to {score}.")
+            LOG.info("Updated prompt score", name=name, score=score)
-            LOG.warning(f"Prompt '{name}' not found for score update.")
+            LOG.warning("Prompt not found for score update", name=name)

Also applies to: 40-40, 43-43, 66-66, 68-68


32-33: Narrow the exception handling.

Catching all exceptions with a bare except Exception is too broad and may hide unexpected errors. Consider catching specific exceptions like jinja2.TemplateNotFound or OSError.

+        from jinja2 import TemplateNotFound
+        
         try:
             # Access the Jinja2 environment from the prompt_engine
             env = prompt_engine.env
             # Construct the path to the template within the Jinja2 environment
             template_path = "skyvern/extract-action.j2"
             # Get the template source from the loader
             baseline_template = env.loader.get_source(env, template_path)[0]
 
-            self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
+            self.add_prompt("baseline", baseline_template, score=1.0)
             LOG.info("Loaded baseline prompt 'extract-action.j2'.")
-        except Exception as e:
-            LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
+        except (TemplateNotFound, OSError) as e:
+            LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)
scripts/run_evolution.py (3)

26-26: Consider making the number of generations configurable.

The hard-coded value num_generations = 5 limits flexibility for experimentation or production use.

Consider adding a command-line argument or environment variable:

+import os
+
 async def main():
     """
     Main function to run the prompt evolution loop.
     """
     LOG.info("Initializing prompt evolution process...")
 
     prompt_manager = PromptManager()
     evolver = Evolve(prompt_manager)
 
     # Check if the baseline prompt was loaded correctly
     if not prompt_manager.get_prompt("baseline"):
         LOG.error("Failed to load baseline prompt. Aborting evolution process.")
         return
 
     LOG.info("Starting evolution loop...")
 
     # Run the evolution loop for a few generations as a demonstration
-    num_generations = 5
+    num_generations = int(os.getenv("EVOLUTION_GENERATIONS", "5"))
     for i in range(num_generations):

30-34: Add error handling for evolution steps.

The evolution loop lacks error handling for potential failures in evolve_prompts() or evaluate_and_score_prompts(). If either fails, the entire loop stops without useful diagnostics.

     for i in range(num_generations):
         LOG.info(f"--- Generation {i+1}/{num_generations} ---")
 
-        # Evolve the prompts to create new variations
-        await evolver.evolve_prompts()
-
-        # Evaluate the performance of the new prompts
-        evolver.evaluate_and_score_prompts()
+        try:
+            # Evolve the prompts to create new variations
+            await evolver.evolve_prompts()
+
+            # Evaluate the performance of the new prompts
+            evolver.evaluate_and_score_prompts()
+        except Exception:
+            LOG.exception("Evolution step failed", generation=i+1)
+            continue

44-44: Document the purpose of the sleep delay.

The 5-second sleep between generations is not explained. Consider documenting why this delay is necessary or making it configurable.

-        # In a real application, you might add a delay or run this as a continuous background process
-        await asyncio.sleep(5)
+        # Brief pause between generations to avoid overwhelming the LLM API
+        # In production, this could be removed or adjusted based on rate limits
+        await asyncio.sleep(int(os.getenv("EVOLUTION_DELAY_SECONDS", "5")))
skyvern/forge/agent.py (2)

1320-1341: Clarify the fallback chain for general tasks.

The fallback logic for general tasks (evolved prompt → baseline → template name) is reasonable, but the "critical error" log message on Line 1339 doesn't immediately raise an exception. This could lead to confusion about whether execution continues.

Consider making the error handling more explicit:

         if task_type == TaskType.general:
             # For general tasks, try to use the best prompt from our evolution manager.
             best_prompt = app.PROMPT_MANAGER.get_best_prompt()
             if best_prompt:
-                LOG.info(f"Using evolved prompt: {best_prompt.name} with score {best_prompt.score}")
+                LOG.info("Using evolved prompt", prompt_name=best_prompt.name, score=best_prompt.score)
                 template_str = best_prompt.template
             else:
                 # If no evolved prompts, fall back to the baseline prompt.
                 LOG.warning("PromptManager has no prompts. Falling back to baseline 'extract-action'.")
                 baseline_prompt = app.PROMPT_MANAGER.get_prompt("baseline")
                 if baseline_prompt:
                     template_str = baseline_prompt.template
                 else:
                     # If even the baseline is missing, this is a critical error.
-                    LOG.error("Baseline prompt could not be loaded from PromptManager.")
-                    # As a last resort, use the template name.
+                    LOG.critical("Baseline prompt could not be loaded from PromptManager, using template name fallback")
+                    # As a last resort, use the template name (may fail if template file is missing)
                     template_name = "extract-action"

1382-1396: Add defensive check before rendering.

The code assumes at least one of template_str or template_name is set, but this isn't guaranteed if the logic above changes. Adding an explicit check after Line 1380 would make the code more robust.

             complete_criterion=task.complete_criterion,
             terminate_criterion=task.terminate_criterion,
         )
 
+        # Ensure at least one rendering path is available
+        if template_str is None and template_name is None:
+            raise UnsupportedTaskType(task_type=task_type)
+
         if template_str is not None:
             # Render the prompt from a raw string (used for evolved prompts)
             return prompt_engine.load_prompt_from_string(
                 template=template_str,
                 **render_kwargs,
             )
 
         if template_name is not None:
             # Render the prompt from a template file by name (standard behavior)
             return prompt_engine.load_prompt(
                 template=template_name,
                 **render_kwargs,
             )
 
-        raise UnsupportedTaskType(task_type=task_type)
skyvern/evolution/evolve.py (1)

23-23: Use structured logging instead of f-strings.

These log statements use f-strings for interpolation. Structlog supports structured logging with keyword arguments for better machine-readability.

Based on learnings:

-        LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")
+        LOG.info("Evolving prompt", prompt_name=best_prompt.name, score=best_prompt.score)
-        LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")
+        LOG.info("Evolved new prompt", prompt_name=new_prompt_name, preview=evolved_prompt_str[:100])
-            LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")
+            LOG.info("Evaluated prompt", prompt_name=name, score=normalized_score)

Also applies to: 43-43, 74-74

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c3b548 and 6bd63e6.

📒 Files selected for processing (7)
  • scripts/run_evolution.py (1 hunks)
  • skyvern/evolution/__init__.py (1 hunks)
  • skyvern/evolution/evolve.py (1 hunks)
  • skyvern/evolution/prompt_manager.py (1 hunks)
  • skyvern/forge/agent.py (3 hunks)
  • skyvern/forge/app.py (2 hunks)
  • skyvern/forge/prompts/skyvern/evolve-prompt.j2 (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{skyvern,integrations,alembic,scripts}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{skyvern,integrations,alembic,scripts}/**/*.py: Use Python 3.11+ features and add type hints throughout the codebase
Follow PEP 8 with a maximum line length of 100 characters
Use absolute imports for all Python modules
Document all public functions and classes with Google-style docstrings
Use snake_case for variables and functions, and PascalCase for classes
Prefer async/await over callbacks in asynchronous code
Use asyncio for concurrency
Always handle exceptions in async code
Use context managers for resource cleanup
Use specific exception classes
Include meaningful error messages when raising or logging exceptions
Log errors with appropriate severity levels
Never expose sensitive information in error messages

Files:

  • skyvern/evolution/__init__.py
  • skyvern/evolution/evolve.py
  • skyvern/forge/agent.py
  • scripts/run_evolution.py
  • skyvern/evolution/prompt_manager.py
  • skyvern/forge/app.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python code must be linted and formatted with Ruff
Use type hints throughout Python code
Prefer async/await for asynchronous Python code
Enforce a maximum line length of 120 characters in Python files

Files:

  • skyvern/evolution/__init__.py
  • skyvern/evolution/evolve.py
  • skyvern/forge/agent.py
  • scripts/run_evolution.py
  • skyvern/evolution/prompt_manager.py
  • skyvern/forge/app.py
skyvern/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Type-check Python code in the skyvern/ package with mypy

Files:

  • skyvern/evolution/__init__.py
  • skyvern/evolution/evolve.py
  • skyvern/forge/agent.py
  • skyvern/evolution/prompt_manager.py
  • skyvern/forge/app.py
🧬 Code graph analysis (4)
skyvern/evolution/evolve.py (2)
skyvern/evolution/prompt_manager.py (3)
  • get_best_prompt (51-58)
  • add_prompt (35-43)
  • update_score (60-68)
skyvern/forge/sdk/prompting.py (1)
  • load_prompt (89-111)
skyvern/forge/agent.py (6)
skyvern/forge/sdk/db/enums.py (1)
  • TaskType (8-11)
skyvern/evolution/prompt_manager.py (2)
  • get_best_prompt (51-58)
  • get_prompt (45-49)
skyvern/webeye/actions/actions.py (1)
  • ActionType (13-37)
skyvern/exceptions.py (2)
  • UnsupportedActionType (322-324)
  • UnsupportedTaskType (545-547)
skyvern/forge/sdk/core/skyvern_context.py (1)
  • ensure_context (43-56)
skyvern/forge/sdk/prompting.py (2)
  • load_prompt_from_string (113-134)
  • load_prompt (89-111)
scripts/run_evolution.py (2)
skyvern/evolution/evolve.py (3)
  • Evolve (9-74)
  • evolve_prompts (14-43)
  • evaluate_and_score_prompts (45-74)
skyvern/evolution/prompt_manager.py (3)
  • PromptManager (13-68)
  • get_prompt (45-49)
  • get_best_prompt (51-58)
skyvern/forge/app.py (1)
skyvern/evolution/prompt_manager.py (1)
  • PromptManager (13-68)
🔇 Additional comments (6)
skyvern/evolution/__init__.py (1)

1-1: LGTM!

Standard Python package initialization file. The comment clearly indicates the intentional design.

skyvern/forge/app.py (1)

47-48: Module-level initialization is appropriate here.

The PROMPT_MANAGER and agent are initialized at module level, consistent with the pattern used for other application globals in this file (e.g., DATABASE, BROWSER_MANAGER). This ensures they're available throughout the application lifecycle.

Note: Ensure that any initialization errors in PromptManager (like baseline loading failures) are logged appropriately, as they will occur during module import.

skyvern/forge/prompts/skyvern/evolve-prompt.j2 (1)

1-17: Well-structured evolution template.

The template provides clear guidance for prompt evolution with:

  • Role definition for the LLM
  • Explicit principles (clarity, role-setting, context, action-oriented, robustness)
  • Clear output format instruction (no extra text)

This aligns well with the evolution workflow described in the PR objectives.

skyvern/forge/agent.py (1)

1352-1353: Good variable renaming for clarity.

Renaming action_type to action_type_str before converting it to the enum improves code readability by making the type transformation explicit.

skyvern/evolution/evolve.py (2)

45-74: Scoring logic is simplistic but acceptable for demonstration.

The evaluate_and_score_prompts method uses a deterministic heuristic (length and keyword matching) rather than actual benchmark results. The docstring acknowledges this is a simulation, which is appropriate for a proof-of-concept.

For production use, consider replacing this with actual performance metrics from agent runs.


33-33: LLM_API_HANDLER called with step=None At evolve.py:33, step=None is a placeholder—verify the handler accepts a null step or pass a valid step object.

Comment on lines +9 to +46
async def main():
"""
Main function to run the prompt evolution loop.
"""
LOG.info("Initializing prompt evolution process...")

prompt_manager = PromptManager()
evolver = Evolve(prompt_manager)

# Check if the baseline prompt was loaded correctly
if not prompt_manager.get_prompt("baseline"):
LOG.error("Failed to load baseline prompt. Aborting evolution process.")
return

LOG.info("Starting evolution loop...")

# Run the evolution loop for a few generations as a demonstration
num_generations = 5
for i in range(num_generations):
LOG.info(f"--- Generation {i+1}/{num_generations} ---")

# Evolve the prompts to create new variations
await evolver.evolve_prompts()

# Evaluate the performance of the new prompts
evolver.evaluate_and_score_prompts()

# Log the best prompt of the current generation
best_prompt = prompt_manager.get_best_prompt()
if best_prompt:
LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")
else:
LOG.warning("No prompts in manager after evolution and evaluation.")

# In a real application, you might add a delay or run this as a continuous background process
await asyncio.sleep(5)

LOG.info("Evolution loop finished.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add type hints to the main function.

The main() function is missing type hints for its return value, which is required by the coding guidelines.

As per coding guidelines:

-async def main():
+async def main() -> None:
     """
     Main function to run the prompt evolution loop.
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def main():
"""
Main function to run the prompt evolution loop.
"""
LOG.info("Initializing prompt evolution process...")
prompt_manager = PromptManager()
evolver = Evolve(prompt_manager)
# Check if the baseline prompt was loaded correctly
if not prompt_manager.get_prompt("baseline"):
LOG.error("Failed to load baseline prompt. Aborting evolution process.")
return
LOG.info("Starting evolution loop...")
# Run the evolution loop for a few generations as a demonstration
num_generations = 5
for i in range(num_generations):
LOG.info(f"--- Generation {i+1}/{num_generations} ---")
# Evolve the prompts to create new variations
await evolver.evolve_prompts()
# Evaluate the performance of the new prompts
evolver.evaluate_and_score_prompts()
# Log the best prompt of the current generation
best_prompt = prompt_manager.get_best_prompt()
if best_prompt:
LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")
else:
LOG.warning("No prompts in manager after evolution and evaluation.")
# In a real application, you might add a delay or run this as a continuous background process
await asyncio.sleep(5)
LOG.info("Evolution loop finished.")
async def main() -> None:
"""
Main function to run the prompt evolution loop.
"""
LOG.info("Initializing prompt evolution process...")
prompt_manager = PromptManager()
evolver = Evolve(prompt_manager)
# Check if the baseline prompt was loaded correctly
if not prompt_manager.get_prompt("baseline"):
LOG.error("Failed to load baseline prompt. Aborting evolution process.")
return
LOG.info("Starting evolution loop...")
# Run the evolution loop for a few generations as a demonstration
num_generations = 5
for i in range(num_generations):
LOG.info(f"--- Generation {i+1}/{num_generations} ---")
# Evolve the prompts to create new variations
await evolver.evolve_prompts()
# Evaluate the performance of the new prompts
evolver.evaluate_and_score_prompts()
# Log the best prompt of the current generation
best_prompt = prompt_manager.get_best_prompt()
if best_prompt:
LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}")
else:
LOG.warning("No prompts in manager after evolution and evaluation.")
# In a real application, you might add a delay or run this as a continuous background process
await asyncio.sleep(5)
LOG.info("Evolution loop finished.")
🤖 Prompt for AI Agents
In scripts/run_evolution.py around lines 9 to 46, the async main() function is
missing a return type annotation; update its signature to include an explicit
return type (async def main() -> None:) to satisfy the coding guidelines and
ensure the coroutine is annotated as returning None; no other behavioral changes
are required.

@@ -0,0 +1,74 @@
import structlog
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove unused import.

The random module is imported but never used in this file.

 import structlog
-import random
 
 from skyvern.forge.prompts import prompt_engine
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import random
import structlog
from skyvern.forge.prompts import prompt_engine
🤖 Prompt for AI Agents
In skyvern/evolution/evolve.py around line 2, the file imports the random module
which is unused; remove the unused import statement (delete or comment out the
"import random" line) to clean up imports and avoid linter warnings.

Comment on lines +9 to +74
class Evolve:
def __init__(self, prompt_manager):
self.prompt_manager = prompt_manager
self.evolution_count = 0

async def evolve_prompts(self):
"""
Takes the top-performing prompts and uses an LLM to generate new variations.
"""
best_prompt = self.prompt_manager.get_best_prompt()
if not best_prompt:
LOG.warning("No prompts found to evolve.")
return

LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")

# Use an LLM to generate a new variation of the prompt.
evolution_prompt = prompt_engine.load_prompt(
"evolve-prompt",
prompt_to_evolve=best_prompt.template,
)

# In a real implementation, a 'step' object would be passed here.
# This is a placeholder for demonstration purposes.
response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)

# Assuming the response is the raw string of the new prompt
evolved_prompt_str = response if isinstance(response, str) else str(response)

# Add the new prompt to the population
self.evolution_count += 1
new_prompt_name = f"evolved_v{self.evolution_count}"
self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)

LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")

def evaluate_and_score_prompts(self):
"""
Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
In a real-world scenario, this would involve running benchmarks.
"""
LOG.info("Evaluating and scoring prompts...")
for name, prompt in self.prompt_manager.prompts.items():
# Skip the baseline prompt as its score is fixed.
if name == "baseline":
continue

score = 0
# Score based on length (ideal length between 500 and 1500 characters)
length = len(prompt.template)
if 500 <= length <= 1500:
score += 0.5
else:
score -= 0.2

# Score based on presence of keywords
keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]
for keyword in keywords:
if keyword in prompt.template.lower():
score += 0.2

# Normalize score to be between 0 and 2 for this simulation
normalized_score = max(0, min(2, score))

self.prompt_manager.update_score(name, normalized_score)
LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add type hints to the Evolve class.

The entire Evolve class is missing type hints for method parameters and return values, which violates the coding guidelines for Python 3.11+.

As per coding guidelines:

+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from skyvern.evolution.prompt_manager import PromptManager
+
 class Evolve:
-    def __init__(self, prompt_manager):
+    def __init__(self, prompt_manager: "PromptManager") -> None:
         self.prompt_manager = prompt_manager
         self.evolution_count = 0
 
-    async def evolve_prompts(self):
+    async def evolve_prompts(self) -> None:
         """
         Takes the top-performing prompts and uses an LLM to generate new variations.
         """
         # ... rest of method
 
-    def evaluate_and_score_prompts(self):
+    def evaluate_and_score_prompts(self) -> None:
         """
         Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
         In a real-world scenario, this would involve running benchmarks.
         """
         # ... rest of method
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class Evolve:
def __init__(self, prompt_manager):
self.prompt_manager = prompt_manager
self.evolution_count = 0
async def evolve_prompts(self):
"""
Takes the top-performing prompts and uses an LLM to generate new variations.
"""
best_prompt = self.prompt_manager.get_best_prompt()
if not best_prompt:
LOG.warning("No prompts found to evolve.")
return
LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")
# Use an LLM to generate a new variation of the prompt.
evolution_prompt = prompt_engine.load_prompt(
"evolve-prompt",
prompt_to_evolve=best_prompt.template,
)
# In a real implementation, a 'step' object would be passed here.
# This is a placeholder for demonstration purposes.
response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)
# Assuming the response is the raw string of the new prompt
evolved_prompt_str = response if isinstance(response, str) else str(response)
# Add the new prompt to the population
self.evolution_count += 1
new_prompt_name = f"evolved_v{self.evolution_count}"
self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)
LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")
def evaluate_and_score_prompts(self):
"""
Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
In a real-world scenario, this would involve running benchmarks.
"""
LOG.info("Evaluating and scoring prompts...")
for name, prompt in self.prompt_manager.prompts.items():
# Skip the baseline prompt as its score is fixed.
if name == "baseline":
continue
score = 0
# Score based on length (ideal length between 500 and 1500 characters)
length = len(prompt.template)
if 500 <= length <= 1500:
score += 0.5
else:
score -= 0.2
# Score based on presence of keywords
keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]
for keyword in keywords:
if keyword in prompt.template.lower():
score += 0.2
# Normalize score to be between 0 and 2 for this simulation
normalized_score = max(0, min(2, score))
self.prompt_manager.update_score(name, normalized_score)
LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from skyvern.evolution.prompt_manager import PromptManager
class Evolve:
def __init__(self, prompt_manager: "PromptManager") -> None:
self.prompt_manager = prompt_manager
self.evolution_count = 0
async def evolve_prompts(self) -> None:
"""
Takes the top-performing prompts and uses an LLM to generate new variations.
"""
best_prompt = self.prompt_manager.get_best_prompt()
if not best_prompt:
LOG.warning("No prompts found to evolve.")
return
LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}")
# Use an LLM to generate a new variation of the prompt.
evolution_prompt = prompt_engine.load_prompt(
"evolve-prompt",
prompt_to_evolve=best_prompt.template,
)
# In a real implementation, a 'step' object would be passed here.
# This is a placeholder for demonstration purposes.
response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None)
# Assuming the response is the raw string of the new prompt
evolved_prompt_str = response if isinstance(response, str) else str(response)
# Add the new prompt to the population
self.evolution_count += 1
new_prompt_name = f"evolved_v{self.evolution_count}"
self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0)
LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...")
def evaluate_and_score_prompts(self) -> None:
"""
Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
In a real-world scenario, this would involve running benchmarks.
"""
LOG.info("Evaluating and scoring prompts...")
for name, prompt in self.prompt_manager.prompts.items():
# Skip the baseline prompt as its score is fixed.
if name == "baseline":
continue
score = 0
# Score based on length (ideal length between 500 and 1500 characters)
length = len(prompt.template)
if 500 <= length <= 1500:
score += 0.5
else:
score -= 0.2
# Score based on presence of keywords
keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"]
for keyword in keywords:
if keyword in prompt.template.lower():
score += 0.2
# Normalize score to be between 0 and 2 for this simulation
normalized_score = max(0, min(2, score))
self.prompt_manager.update_score(name, normalized_score)
LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}")
🤖 Prompt for AI Agents
In skyvern/evolution/evolve.py around lines 9-74, the Evolve class and its
methods lack Python 3.11+ type hints; add explicit type annotations for the
class attributes and method signatures: annotate __init__ to accept
prompt_manager: "PromptManager" (use a forward reference or import the
PromptManager type), self.evolution_count: int, and self.prompt_manager:
"PromptManager"; annotate async def evolve_prompts(self) -> None and def
evaluate_and_score_prompts(self) -> None; annotate local variables where helpful
(e.g., best_prompt: Optional[Prompt], response: Any, evolved_prompt_str: str,
new_prompt_name: str, score: float, normalized_score: float) and ensure you
import necessary typing items (Optional, Any, Optional["Prompt"] or a Prompt
type, and if needed Coroutine) or reference existing project types; update
function and variable annotations accordingly without changing logic.

Comment on lines +7 to +11
class Prompt:
def __init__(self, name, template, score=0):
self.name = name
self.template = template
self.score = score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add type hints and class docstring.

The Prompt class is missing type hints for its __init__ parameters and lacks a class-level docstring describing its purpose.

As per coding guidelines, apply this diff:

 class Prompt:
+    """
+    Represents a prompt template with its associated metadata.
+    
+    Attributes:
+        name: Unique identifier for the prompt.
+        template: The Jinja2 template string.
+        score: Performance score for ranking (default: 0).
+    """
-    def __init__(self, name, template, score=0):
+    def __init__(self, name: str, template: str, score: float = 0) -> None:
         self.name = name
         self.template = template
         self.score = score
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class Prompt:
def __init__(self, name, template, score=0):
self.name = name
self.template = template
self.score = score
class Prompt:
"""
Represents a prompt template with its associated metadata.
Attributes:
name: Unique identifier for the prompt.
template: The Jinja2 template string.
score: Performance score for ranking (default: 0).
"""
def __init__(self, name: str, template: str, score: float = 0) -> None:
self.name = name
self.template = template
self.score = score
🤖 Prompt for AI Agents
In skyvern/evolution/prompt_manager.py around lines 7 to 11, the Prompt class
lacks a class-level docstring and type hints; add a concise docstring explaining
that Prompt represents a named prompt template with an associated score,
annotate the class attributes (name: str, template: str, score: int = 0) and
update the __init__ signature to use type hints (def __init__(self, name: str,
template: str, score: int = 0) -> None:) so static type checkers and IDEs can
validate usage.

Comment on lines +13 to +68
class PromptManager:
def __init__(self):
self.prompts = {}
self._load_baseline_prompt()

def _load_baseline_prompt(self):
"""
Loads the original 'extract-action.j2' prompt as the baseline.
"""
try:
# Access the Jinja2 environment from the prompt_engine
env = prompt_engine.env
# Construct the path to the template within the Jinja2 environment
template_path = "skyvern/extract-action.j2"
# Get the template source from the loader
baseline_template = env.loader.get_source(env, template_path)[0]

self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
LOG.info("Loaded baseline prompt 'extract-action.j2'.")
except Exception as e:
LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)

def add_prompt(self, name, template, score=0):
"""
Adds a new prompt to the population.
"""
if name in self.prompts:
LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")

self.prompts[name] = Prompt(name, template, score)
LOG.info(f"Added prompt '{name}' with score {score}.")

def get_prompt(self, name):
"""
Retrieves a prompt object by its name.
"""
return self.prompts.get(name)

def get_best_prompt(self):
"""
Returns the prompt with the highest score.
"""
if not self.prompts:
return None

return max(self.prompts.values(), key=lambda p: p.score)

def update_score(self, name, score):
"""
Updates the score of a prompt after evaluation.
"""
if name in self.prompts:
self.prompts[name].score = score
LOG.info(f"Updated score for prompt '{name}' to {score}.")
else:
LOG.warning(f"Prompt '{name}' not found for score update.") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add type hints to all methods.

The PromptManager class methods are missing type hints for parameters and return values, which is required by the coding guidelines for Python 3.11+.

As per coding guidelines, apply these changes:

+from typing import Optional
+
 class PromptManager:
-    def __init__(self):
+    def __init__(self) -> None:
         self.prompts = {}
         self._load_baseline_prompt()
 
-    def _load_baseline_prompt(self):
+    def _load_baseline_prompt(self) -> None:
         """
         Loads the original 'extract-action.j2' prompt as the baseline.
         """
         # ... rest of method
 
-    def add_prompt(self, name, template, score=0):
+    def add_prompt(self, name: str, template: str, score: float = 0) -> None:
         """
         Adds a new prompt to the population.
         """
         # ... rest of method
 
-    def get_prompt(self, name):
+    def get_prompt(self, name: str) -> Optional[Prompt]:
         """
         Retrieves a prompt object by its name.
         """
         return self.prompts.get(name)
 
-    def get_best_prompt(self):
+    def get_best_prompt(self) -> Optional[Prompt]:
         """
         Returns the prompt with the highest score.
         """
         # ... rest of method
 
-    def update_score(self, name, score):
+    def update_score(self, name: str, score: float) -> None:
         """
         Updates the score of a prompt after evaluation.
         """
         # ... rest of method
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class PromptManager:
def __init__(self):
self.prompts = {}
self._load_baseline_prompt()
def _load_baseline_prompt(self):
"""
Loads the original 'extract-action.j2' prompt as the baseline.
"""
try:
# Access the Jinja2 environment from the prompt_engine
env = prompt_engine.env
# Construct the path to the template within the Jinja2 environment
template_path = "skyvern/extract-action.j2"
# Get the template source from the loader
baseline_template = env.loader.get_source(env, template_path)[0]
self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
LOG.info("Loaded baseline prompt 'extract-action.j2'.")
except Exception as e:
LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
def add_prompt(self, name, template, score=0):
"""
Adds a new prompt to the population.
"""
if name in self.prompts:
LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")
self.prompts[name] = Prompt(name, template, score)
LOG.info(f"Added prompt '{name}' with score {score}.")
def get_prompt(self, name):
"""
Retrieves a prompt object by its name.
"""
return self.prompts.get(name)
def get_best_prompt(self):
"""
Returns the prompt with the highest score.
"""
if not self.prompts:
return None
return max(self.prompts.values(), key=lambda p: p.score)
def update_score(self, name, score):
"""
Updates the score of a prompt after evaluation.
"""
if name in self.prompts:
self.prompts[name].score = score
LOG.info(f"Updated score for prompt '{name}' to {score}.")
else:
LOG.warning(f"Prompt '{name}' not found for score update.")
from typing import Optional
class PromptManager:
def __init__(self) -> None:
self.prompts: dict[str, Prompt] = {}
self._load_baseline_prompt()
def _load_baseline_prompt(self) -> None:
"""
Loads the original 'extract-action.j2' prompt as the baseline.
"""
try:
# Access the Jinja2 environment from the prompt_engine
env = prompt_engine.env
# Construct the path to the template within the Jinja2 environment
template_path = "skyvern/extract-action.j2"
# Get the template source from the loader
baseline_template = env.loader.get_source(env, template_path)[0]
self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good.
LOG.info("Loaded baseline prompt 'extract-action.j2'.")
except Exception as e:
LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True)
def add_prompt(self, name: str, template: str, score: float = 0) -> None:
"""
Adds a new prompt to the population.
"""
if name in self.prompts:
LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.")
self.prompts[name] = Prompt(name, template, score)
LOG.info(f"Added prompt '{name}' with score {score}.")
def get_prompt(self, name: str) -> Optional[Prompt]:
"""
Retrieves a prompt object by its name.
"""
return self.prompts.get(name)
def get_best_prompt(self) -> Optional[Prompt]:
"""
Returns the prompt with the highest score.
"""
if not self.prompts:
return None
return max(self.prompts.values(), key=lambda p: p.score)
def update_score(self, name: str, score: float) -> None:
"""
Updates the score of a prompt after evaluation.
"""
if name in self.prompts:
self.prompts[name].score = score
LOG.info(f"Updated score for prompt '{name}' to {score}.")
else:
LOG.warning(f"Prompt '{name}' not found for score update.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant