-
Notifications
You must be signed in to change notification settings - Fork 1.6k
This change introduces a new prompt evolution system inspired by the … #3644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability. The core of this feature is the new `skyvern/evolution` package, which includes: - `PromptManager`: A class for managing a population of prompts and their performance scores. - `Evolve`: A class that uses an LLM to generate new variations of prompts based on their performance. - `evolve-prompt.j2`: A new prompt template to guide the LLM in the evolution process. A new script, `scripts/run_evolution.py`, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts. The `ForgeAgent` has been integrated with the `PromptManager` to dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.
WalkthroughAdds a prompt-evolution subsystem: a PromptManager to load/manage prompts, an Evolve class to generate/evaluate evolved prompts via an LLM and a scoring heuristic, a script to run multi-generation evolution, a new evolve-prompt template, and ForgeAgent changes to prefer evolved prompts at runtime with app-level globals initialized. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as CLI User
participant Script as run_evolution.py
participant PM as PromptManager
participant Evo as Evolve
participant LLM as LLM_API_HANDLER
participant PE as prompt_engine
User->>Script: python scripts/run_evolution.py
Script->>PM: init() and load baseline
Script->>Evo: init(PromptManager)
loop generations (x5)
Script->>Evo: evolve_prompts()
Evo->>PM: get_best_prompt()
alt best prompt exists
Evo->>PE: load_prompt("evolve-prompt", target=best.template)
Evo->>LLM: generate evolved prompt (input: evolution prompt)
LLM-->>Evo: evolved prompt text
Evo->>PM: add_prompt(name=evolved_vN, template=..., score=0)
else
Evo->>Evo: log warning and return
end
Script->>Evo: evaluate_and_score_prompts()
Evo->>PM: update_score(...) per prompt
Script->>PM: get_best_prompt() and log
Script->>Script: asyncio.sleep()
end
sequenceDiagram
autonumber
participant Forge as ForgeAgent
participant PM as PromptManager
participant PE as prompt_engine
participant Render as Renderer
Forge->>PM: get_best_prompt()
alt evolved prompt available
Forge->>PE: load_prompt_from_string(best.template)
else
Forge->>PE: load_prompt(".../extract-action.j2" or mapped template)
end
PE-->>Render: compiled template
Render-->>Forge: rendered prompt
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to 6bd63e6 in 2 minutes and 26 seconds. Click for details.
- Reviewed
361lines of code in7files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. scripts/run_evolution.py:52
- Draft comment:
End the file with a newline for POSIX compliance. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While this is a real issue, it's very minor and would likely be caught by linters or formatters. Most modern IDEs automatically add trailing newlines. The comment doesn't suggest a critical code change that would affect functionality. The issue is real and technically correct. Missing newlines can cause issues with some UNIX tools and is considered bad form. However, this is exactly the kind of minor, obvious issue that should be handled by automated tools rather than manual review comments. This comment should be removed as it's too minor and would be better handled by automated tooling.
2. skyvern/forge/agent.py:1325
- Draft comment:
Consider refactoring the prompt selection fallback logic into a helper to improve clarity and avoid repetition. - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_NTbPeRaXsFZ9OvJq
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| @@ -0,0 +1,74 @@ | |||
| import structlog | |||
| import random | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import 'random' if it's not used.
| import random |
|
|
||
| # In a real implementation, a 'step' object would be passed here. | ||
| # This is a placeholder for demonstration purposes. | ||
| response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling around the LLM_API_HANDLER call to catch unexpected failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (8)
skyvern/evolution/prompt_manager.py (2)
30-31: Use structured logging instead of f-strings.These log statements use f-strings, but structlog supports structured logging with keyword arguments that provide better machine-readability and context.
Based on learnings, apply these changes:
- self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good. - LOG.info("Loaded baseline prompt 'extract-action.j2'.") + self.add_prompt("baseline", baseline_template, score=1.0) + LOG.info("Loaded baseline prompt", template_name="extract-action.j2")- LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True) + LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)- LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.") + LOG.warning("Prompt already exists, overwriting", name=name)- LOG.info(f"Added prompt '{name}' with score {score}.") + LOG.info("Added prompt", name=name, score=score)- LOG.info(f"Updated score for prompt '{name}' to {score}.") + LOG.info("Updated prompt score", name=name, score=score)- LOG.warning(f"Prompt '{name}' not found for score update.") + LOG.warning("Prompt not found for score update", name=name)Also applies to: 40-40, 43-43, 66-66, 68-68
32-33: Narrow the exception handling.Catching all exceptions with a bare
except Exceptionis too broad and may hide unexpected errors. Consider catching specific exceptions likejinja2.TemplateNotFoundorOSError.+ from jinja2 import TemplateNotFound + try: # Access the Jinja2 environment from the prompt_engine env = prompt_engine.env # Construct the path to the template within the Jinja2 environment template_path = "skyvern/extract-action.j2" # Get the template source from the loader baseline_template = env.loader.get_source(env, template_path)[0] - self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good. + self.add_prompt("baseline", baseline_template, score=1.0) LOG.info("Loaded baseline prompt 'extract-action.j2'.") - except Exception as e: - LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True) + except (TemplateNotFound, OSError) as e: + LOG.error("Failed to load baseline prompt", error=str(e), exc_info=True)scripts/run_evolution.py (3)
26-26: Consider making the number of generations configurable.The hard-coded value
num_generations = 5limits flexibility for experimentation or production use.Consider adding a command-line argument or environment variable:
+import os + async def main(): """ Main function to run the prompt evolution loop. """ LOG.info("Initializing prompt evolution process...") prompt_manager = PromptManager() evolver = Evolve(prompt_manager) # Check if the baseline prompt was loaded correctly if not prompt_manager.get_prompt("baseline"): LOG.error("Failed to load baseline prompt. Aborting evolution process.") return LOG.info("Starting evolution loop...") # Run the evolution loop for a few generations as a demonstration - num_generations = 5 + num_generations = int(os.getenv("EVOLUTION_GENERATIONS", "5")) for i in range(num_generations):
30-34: Add error handling for evolution steps.The evolution loop lacks error handling for potential failures in
evolve_prompts()orevaluate_and_score_prompts(). If either fails, the entire loop stops without useful diagnostics.for i in range(num_generations): LOG.info(f"--- Generation {i+1}/{num_generations} ---") - # Evolve the prompts to create new variations - await evolver.evolve_prompts() - - # Evaluate the performance of the new prompts - evolver.evaluate_and_score_prompts() + try: + # Evolve the prompts to create new variations + await evolver.evolve_prompts() + + # Evaluate the performance of the new prompts + evolver.evaluate_and_score_prompts() + except Exception: + LOG.exception("Evolution step failed", generation=i+1) + continue
44-44: Document the purpose of the sleep delay.The 5-second sleep between generations is not explained. Consider documenting why this delay is necessary or making it configurable.
- # In a real application, you might add a delay or run this as a continuous background process - await asyncio.sleep(5) + # Brief pause between generations to avoid overwhelming the LLM API + # In production, this could be removed or adjusted based on rate limits + await asyncio.sleep(int(os.getenv("EVOLUTION_DELAY_SECONDS", "5")))skyvern/forge/agent.py (2)
1320-1341: Clarify the fallback chain for general tasks.The fallback logic for general tasks (evolved prompt → baseline → template name) is reasonable, but the "critical error" log message on Line 1339 doesn't immediately raise an exception. This could lead to confusion about whether execution continues.
Consider making the error handling more explicit:
if task_type == TaskType.general: # For general tasks, try to use the best prompt from our evolution manager. best_prompt = app.PROMPT_MANAGER.get_best_prompt() if best_prompt: - LOG.info(f"Using evolved prompt: {best_prompt.name} with score {best_prompt.score}") + LOG.info("Using evolved prompt", prompt_name=best_prompt.name, score=best_prompt.score) template_str = best_prompt.template else: # If no evolved prompts, fall back to the baseline prompt. LOG.warning("PromptManager has no prompts. Falling back to baseline 'extract-action'.") baseline_prompt = app.PROMPT_MANAGER.get_prompt("baseline") if baseline_prompt: template_str = baseline_prompt.template else: # If even the baseline is missing, this is a critical error. - LOG.error("Baseline prompt could not be loaded from PromptManager.") - # As a last resort, use the template name. + LOG.critical("Baseline prompt could not be loaded from PromptManager, using template name fallback") + # As a last resort, use the template name (may fail if template file is missing) template_name = "extract-action"
1382-1396: Add defensive check before rendering.The code assumes at least one of
template_strortemplate_nameis set, but this isn't guaranteed if the logic above changes. Adding an explicit check after Line 1380 would make the code more robust.complete_criterion=task.complete_criterion, terminate_criterion=task.terminate_criterion, ) + # Ensure at least one rendering path is available + if template_str is None and template_name is None: + raise UnsupportedTaskType(task_type=task_type) + if template_str is not None: # Render the prompt from a raw string (used for evolved prompts) return prompt_engine.load_prompt_from_string( template=template_str, **render_kwargs, ) if template_name is not None: # Render the prompt from a template file by name (standard behavior) return prompt_engine.load_prompt( template=template_name, **render_kwargs, ) - raise UnsupportedTaskType(task_type=task_type)skyvern/evolution/evolve.py (1)
23-23: Use structured logging instead of f-strings.These log statements use f-strings for interpolation. Structlog supports structured logging with keyword arguments for better machine-readability.
Based on learnings:
- LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}") + LOG.info("Evolving prompt", prompt_name=best_prompt.name, score=best_prompt.score)- LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...") + LOG.info("Evolved new prompt", prompt_name=new_prompt_name, preview=evolved_prompt_str[:100])- LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}") + LOG.info("Evaluated prompt", prompt_name=name, score=normalized_score)Also applies to: 43-43, 74-74
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
scripts/run_evolution.py(1 hunks)skyvern/evolution/__init__.py(1 hunks)skyvern/evolution/evolve.py(1 hunks)skyvern/evolution/prompt_manager.py(1 hunks)skyvern/forge/agent.py(3 hunks)skyvern/forge/app.py(2 hunks)skyvern/forge/prompts/skyvern/evolve-prompt.j2(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{skyvern,integrations,alembic,scripts}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
{skyvern,integrations,alembic,scripts}/**/*.py: Use Python 3.11+ features and add type hints throughout the codebase
Follow PEP 8 with a maximum line length of 100 characters
Use absolute imports for all Python modules
Document all public functions and classes with Google-style docstrings
Use snake_case for variables and functions, and PascalCase for classes
Prefer async/await over callbacks in asynchronous code
Use asyncio for concurrency
Always handle exceptions in async code
Use context managers for resource cleanup
Use specific exception classes
Include meaningful error messages when raising or logging exceptions
Log errors with appropriate severity levels
Never expose sensitive information in error messages
Files:
skyvern/evolution/__init__.pyskyvern/evolution/evolve.pyskyvern/forge/agent.pyscripts/run_evolution.pyskyvern/evolution/prompt_manager.pyskyvern/forge/app.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python code must be linted and formatted with Ruff
Use type hints throughout Python code
Prefer async/await for asynchronous Python code
Enforce a maximum line length of 120 characters in Python files
Files:
skyvern/evolution/__init__.pyskyvern/evolution/evolve.pyskyvern/forge/agent.pyscripts/run_evolution.pyskyvern/evolution/prompt_manager.pyskyvern/forge/app.py
skyvern/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Type-check Python code in the skyvern/ package with mypy
Files:
skyvern/evolution/__init__.pyskyvern/evolution/evolve.pyskyvern/forge/agent.pyskyvern/evolution/prompt_manager.pyskyvern/forge/app.py
🧬 Code graph analysis (4)
skyvern/evolution/evolve.py (2)
skyvern/evolution/prompt_manager.py (3)
get_best_prompt(51-58)add_prompt(35-43)update_score(60-68)skyvern/forge/sdk/prompting.py (1)
load_prompt(89-111)
skyvern/forge/agent.py (6)
skyvern/forge/sdk/db/enums.py (1)
TaskType(8-11)skyvern/evolution/prompt_manager.py (2)
get_best_prompt(51-58)get_prompt(45-49)skyvern/webeye/actions/actions.py (1)
ActionType(13-37)skyvern/exceptions.py (2)
UnsupportedActionType(322-324)UnsupportedTaskType(545-547)skyvern/forge/sdk/core/skyvern_context.py (1)
ensure_context(43-56)skyvern/forge/sdk/prompting.py (2)
load_prompt_from_string(113-134)load_prompt(89-111)
scripts/run_evolution.py (2)
skyvern/evolution/evolve.py (3)
Evolve(9-74)evolve_prompts(14-43)evaluate_and_score_prompts(45-74)skyvern/evolution/prompt_manager.py (3)
PromptManager(13-68)get_prompt(45-49)get_best_prompt(51-58)
skyvern/forge/app.py (1)
skyvern/evolution/prompt_manager.py (1)
PromptManager(13-68)
🔇 Additional comments (6)
skyvern/evolution/__init__.py (1)
1-1: LGTM!Standard Python package initialization file. The comment clearly indicates the intentional design.
skyvern/forge/app.py (1)
47-48: Module-level initialization is appropriate here.The
PROMPT_MANAGERandagentare initialized at module level, consistent with the pattern used for other application globals in this file (e.g.,DATABASE,BROWSER_MANAGER). This ensures they're available throughout the application lifecycle.Note: Ensure that any initialization errors in
PromptManager(like baseline loading failures) are logged appropriately, as they will occur during module import.skyvern/forge/prompts/skyvern/evolve-prompt.j2 (1)
1-17: Well-structured evolution template.The template provides clear guidance for prompt evolution with:
- Role definition for the LLM
- Explicit principles (clarity, role-setting, context, action-oriented, robustness)
- Clear output format instruction (no extra text)
This aligns well with the evolution workflow described in the PR objectives.
skyvern/forge/agent.py (1)
1352-1353: Good variable renaming for clarity.Renaming
action_typetoaction_type_strbefore converting it to the enum improves code readability by making the type transformation explicit.skyvern/evolution/evolve.py (2)
45-74: Scoring logic is simplistic but acceptable for demonstration.The
evaluate_and_score_promptsmethod uses a deterministic heuristic (length and keyword matching) rather than actual benchmark results. The docstring acknowledges this is a simulation, which is appropriate for a proof-of-concept.For production use, consider replacing this with actual performance metrics from agent runs.
33-33: LLM_API_HANDLER called with step=None At evolve.py:33,step=Noneis a placeholder—verify the handler accepts a null step or pass a valid step object.
| async def main(): | ||
| """ | ||
| Main function to run the prompt evolution loop. | ||
| """ | ||
| LOG.info("Initializing prompt evolution process...") | ||
|
|
||
| prompt_manager = PromptManager() | ||
| evolver = Evolve(prompt_manager) | ||
|
|
||
| # Check if the baseline prompt was loaded correctly | ||
| if not prompt_manager.get_prompt("baseline"): | ||
| LOG.error("Failed to load baseline prompt. Aborting evolution process.") | ||
| return | ||
|
|
||
| LOG.info("Starting evolution loop...") | ||
|
|
||
| # Run the evolution loop for a few generations as a demonstration | ||
| num_generations = 5 | ||
| for i in range(num_generations): | ||
| LOG.info(f"--- Generation {i+1}/{num_generations} ---") | ||
|
|
||
| # Evolve the prompts to create new variations | ||
| await evolver.evolve_prompts() | ||
|
|
||
| # Evaluate the performance of the new prompts | ||
| evolver.evaluate_and_score_prompts() | ||
|
|
||
| # Log the best prompt of the current generation | ||
| best_prompt = prompt_manager.get_best_prompt() | ||
| if best_prompt: | ||
| LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}") | ||
| else: | ||
| LOG.warning("No prompts in manager after evolution and evaluation.") | ||
|
|
||
| # In a real application, you might add a delay or run this as a continuous background process | ||
| await asyncio.sleep(5) | ||
|
|
||
| LOG.info("Evolution loop finished.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add type hints to the main function.
The main() function is missing type hints for its return value, which is required by the coding guidelines.
As per coding guidelines:
-async def main():
+async def main() -> None:
"""
Main function to run the prompt evolution loop.
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def main(): | |
| """ | |
| Main function to run the prompt evolution loop. | |
| """ | |
| LOG.info("Initializing prompt evolution process...") | |
| prompt_manager = PromptManager() | |
| evolver = Evolve(prompt_manager) | |
| # Check if the baseline prompt was loaded correctly | |
| if not prompt_manager.get_prompt("baseline"): | |
| LOG.error("Failed to load baseline prompt. Aborting evolution process.") | |
| return | |
| LOG.info("Starting evolution loop...") | |
| # Run the evolution loop for a few generations as a demonstration | |
| num_generations = 5 | |
| for i in range(num_generations): | |
| LOG.info(f"--- Generation {i+1}/{num_generations} ---") | |
| # Evolve the prompts to create new variations | |
| await evolver.evolve_prompts() | |
| # Evaluate the performance of the new prompts | |
| evolver.evaluate_and_score_prompts() | |
| # Log the best prompt of the current generation | |
| best_prompt = prompt_manager.get_best_prompt() | |
| if best_prompt: | |
| LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}") | |
| else: | |
| LOG.warning("No prompts in manager after evolution and evaluation.") | |
| # In a real application, you might add a delay or run this as a continuous background process | |
| await asyncio.sleep(5) | |
| LOG.info("Evolution loop finished.") | |
| async def main() -> None: | |
| """ | |
| Main function to run the prompt evolution loop. | |
| """ | |
| LOG.info("Initializing prompt evolution process...") | |
| prompt_manager = PromptManager() | |
| evolver = Evolve(prompt_manager) | |
| # Check if the baseline prompt was loaded correctly | |
| if not prompt_manager.get_prompt("baseline"): | |
| LOG.error("Failed to load baseline prompt. Aborting evolution process.") | |
| return | |
| LOG.info("Starting evolution loop...") | |
| # Run the evolution loop for a few generations as a demonstration | |
| num_generations = 5 | |
| for i in range(num_generations): | |
| LOG.info(f"--- Generation {i+1}/{num_generations} ---") | |
| # Evolve the prompts to create new variations | |
| await evolver.evolve_prompts() | |
| # Evaluate the performance of the new prompts | |
| evolver.evaluate_and_score_prompts() | |
| # Log the best prompt of the current generation | |
| best_prompt = prompt_manager.get_best_prompt() | |
| if best_prompt: | |
| LOG.info(f"Best prompt of generation {i+1}: '{best_prompt.name}' with score {best_prompt.score}") | |
| else: | |
| LOG.warning("No prompts in manager after evolution and evaluation.") | |
| # In a real application, you might add a delay or run this as a continuous background process | |
| await asyncio.sleep(5) | |
| LOG.info("Evolution loop finished.") |
🤖 Prompt for AI Agents
In scripts/run_evolution.py around lines 9 to 46, the async main() function is
missing a return type annotation; update its signature to include an explicit
return type (async def main() -> None:) to satisfy the coding guidelines and
ensure the coroutine is annotated as returning None; no other behavioral changes
are required.
| @@ -0,0 +1,74 @@ | |||
| import structlog | |||
| import random | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove unused import.
The random module is imported but never used in this file.
import structlog
-import random
from skyvern.forge.prompts import prompt_engine📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import random | |
| import structlog | |
| from skyvern.forge.prompts import prompt_engine |
🤖 Prompt for AI Agents
In skyvern/evolution/evolve.py around line 2, the file imports the random module
which is unused; remove the unused import statement (delete or comment out the
"import random" line) to clean up imports and avoid linter warnings.
| class Evolve: | ||
| def __init__(self, prompt_manager): | ||
| self.prompt_manager = prompt_manager | ||
| self.evolution_count = 0 | ||
|
|
||
| async def evolve_prompts(self): | ||
| """ | ||
| Takes the top-performing prompts and uses an LLM to generate new variations. | ||
| """ | ||
| best_prompt = self.prompt_manager.get_best_prompt() | ||
| if not best_prompt: | ||
| LOG.warning("No prompts found to evolve.") | ||
| return | ||
|
|
||
| LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}") | ||
|
|
||
| # Use an LLM to generate a new variation of the prompt. | ||
| evolution_prompt = prompt_engine.load_prompt( | ||
| "evolve-prompt", | ||
| prompt_to_evolve=best_prompt.template, | ||
| ) | ||
|
|
||
| # In a real implementation, a 'step' object would be passed here. | ||
| # This is a placeholder for demonstration purposes. | ||
| response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None) | ||
|
|
||
| # Assuming the response is the raw string of the new prompt | ||
| evolved_prompt_str = response if isinstance(response, str) else str(response) | ||
|
|
||
| # Add the new prompt to the population | ||
| self.evolution_count += 1 | ||
| new_prompt_name = f"evolved_v{self.evolution_count}" | ||
| self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0) | ||
|
|
||
| LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...") | ||
|
|
||
| def evaluate_and_score_prompts(self): | ||
| """ | ||
| Simulates the evaluation of prompts and updates their scores based on deterministic criteria. | ||
| In a real-world scenario, this would involve running benchmarks. | ||
| """ | ||
| LOG.info("Evaluating and scoring prompts...") | ||
| for name, prompt in self.prompt_manager.prompts.items(): | ||
| # Skip the baseline prompt as its score is fixed. | ||
| if name == "baseline": | ||
| continue | ||
|
|
||
| score = 0 | ||
| # Score based on length (ideal length between 500 and 1500 characters) | ||
| length = len(prompt.template) | ||
| if 500 <= length <= 1500: | ||
| score += 0.5 | ||
| else: | ||
| score -= 0.2 | ||
|
|
||
| # Score based on presence of keywords | ||
| keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"] | ||
| for keyword in keywords: | ||
| if keyword in prompt.template.lower(): | ||
| score += 0.2 | ||
|
|
||
| # Normalize score to be between 0 and 2 for this simulation | ||
| normalized_score = max(0, min(2, score)) | ||
|
|
||
| self.prompt_manager.update_score(name, normalized_score) | ||
| LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}") No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add type hints to the Evolve class.
The entire Evolve class is missing type hints for method parameters and return values, which violates the coding guidelines for Python 3.11+.
As per coding guidelines:
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+ from skyvern.evolution.prompt_manager import PromptManager
+
class Evolve:
- def __init__(self, prompt_manager):
+ def __init__(self, prompt_manager: "PromptManager") -> None:
self.prompt_manager = prompt_manager
self.evolution_count = 0
- async def evolve_prompts(self):
+ async def evolve_prompts(self) -> None:
"""
Takes the top-performing prompts and uses an LLM to generate new variations.
"""
# ... rest of method
- def evaluate_and_score_prompts(self):
+ def evaluate_and_score_prompts(self) -> None:
"""
Simulates the evaluation of prompts and updates their scores based on deterministic criteria.
In a real-world scenario, this would involve running benchmarks.
"""
# ... rest of method📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class Evolve: | |
| def __init__(self, prompt_manager): | |
| self.prompt_manager = prompt_manager | |
| self.evolution_count = 0 | |
| async def evolve_prompts(self): | |
| """ | |
| Takes the top-performing prompts and uses an LLM to generate new variations. | |
| """ | |
| best_prompt = self.prompt_manager.get_best_prompt() | |
| if not best_prompt: | |
| LOG.warning("No prompts found to evolve.") | |
| return | |
| LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}") | |
| # Use an LLM to generate a new variation of the prompt. | |
| evolution_prompt = prompt_engine.load_prompt( | |
| "evolve-prompt", | |
| prompt_to_evolve=best_prompt.template, | |
| ) | |
| # In a real implementation, a 'step' object would be passed here. | |
| # This is a placeholder for demonstration purposes. | |
| response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None) | |
| # Assuming the response is the raw string of the new prompt | |
| evolved_prompt_str = response if isinstance(response, str) else str(response) | |
| # Add the new prompt to the population | |
| self.evolution_count += 1 | |
| new_prompt_name = f"evolved_v{self.evolution_count}" | |
| self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0) | |
| LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...") | |
| def evaluate_and_score_prompts(self): | |
| """ | |
| Simulates the evaluation of prompts and updates their scores based on deterministic criteria. | |
| In a real-world scenario, this would involve running benchmarks. | |
| """ | |
| LOG.info("Evaluating and scoring prompts...") | |
| for name, prompt in self.prompt_manager.prompts.items(): | |
| # Skip the baseline prompt as its score is fixed. | |
| if name == "baseline": | |
| continue | |
| score = 0 | |
| # Score based on length (ideal length between 500 and 1500 characters) | |
| length = len(prompt.template) | |
| if 500 <= length <= 1500: | |
| score += 0.5 | |
| else: | |
| score -= 0.2 | |
| # Score based on presence of keywords | |
| keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"] | |
| for keyword in keywords: | |
| if keyword in prompt.template.lower(): | |
| score += 0.2 | |
| # Normalize score to be between 0 and 2 for this simulation | |
| normalized_score = max(0, min(2, score)) | |
| self.prompt_manager.update_score(name, normalized_score) | |
| LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}") | |
| from typing import TYPE_CHECKING | |
| if TYPE_CHECKING: | |
| from skyvern.evolution.prompt_manager import PromptManager | |
| class Evolve: | |
| def __init__(self, prompt_manager: "PromptManager") -> None: | |
| self.prompt_manager = prompt_manager | |
| self.evolution_count = 0 | |
| async def evolve_prompts(self) -> None: | |
| """ | |
| Takes the top-performing prompts and uses an LLM to generate new variations. | |
| """ | |
| best_prompt = self.prompt_manager.get_best_prompt() | |
| if not best_prompt: | |
| LOG.warning("No prompts found to evolve.") | |
| return | |
| LOG.info(f"Evolving prompt '{best_prompt.name}' with score {best_prompt.score}") | |
| # Use an LLM to generate a new variation of the prompt. | |
| evolution_prompt = prompt_engine.load_prompt( | |
| "evolve-prompt", | |
| prompt_to_evolve=best_prompt.template, | |
| ) | |
| # In a real implementation, a 'step' object would be passed here. | |
| # This is a placeholder for demonstration purposes. | |
| response = await LLM_API_HANDLER(prompt=evolution_prompt, step=None) | |
| # Assuming the response is the raw string of the new prompt | |
| evolved_prompt_str = response if isinstance(response, str) else str(response) | |
| # Add the new prompt to the population | |
| self.evolution_count += 1 | |
| new_prompt_name = f"evolved_v{self.evolution_count}" | |
| self.prompt_manager.add_prompt(new_prompt_name, evolved_prompt_str, score=0) | |
| LOG.info(f"Evolved new prompt '{new_prompt_name}': {evolved_prompt_str[:100]}...") | |
| def evaluate_and_score_prompts(self) -> None: | |
| """ | |
| Simulates the evaluation of prompts and updates their scores based on deterministic criteria. | |
| In a real-world scenario, this would involve running benchmarks. | |
| """ | |
| LOG.info("Evaluating and scoring prompts...") | |
| for name, prompt in self.prompt_manager.prompts.items(): | |
| # Skip the baseline prompt as its score is fixed. | |
| if name == "baseline": | |
| continue | |
| score = 0 | |
| # Score based on length (ideal length between 500 and 1500 characters) | |
| length = len(prompt.template) | |
| if 500 <= length <= 1500: | |
| score += 0.5 | |
| else: | |
| score -= 0.2 | |
| # Score based on presence of keywords | |
| keywords = ["action", "reasoning", "COMPLETE", "TERMINATE", "element", "goal"] | |
| for keyword in keywords: | |
| if keyword in prompt.template.lower(): | |
| score += 0.2 | |
| # Normalize score to be between 0 and 2 for this simulation | |
| normalized_score = max(0, min(2, score)) | |
| self.prompt_manager.update_score(name, normalized_score) | |
| LOG.info(f"Evaluated '{name}', assigned score: {normalized_score}") |
🤖 Prompt for AI Agents
In skyvern/evolution/evolve.py around lines 9-74, the Evolve class and its
methods lack Python 3.11+ type hints; add explicit type annotations for the
class attributes and method signatures: annotate __init__ to accept
prompt_manager: "PromptManager" (use a forward reference or import the
PromptManager type), self.evolution_count: int, and self.prompt_manager:
"PromptManager"; annotate async def evolve_prompts(self) -> None and def
evaluate_and_score_prompts(self) -> None; annotate local variables where helpful
(e.g., best_prompt: Optional[Prompt], response: Any, evolved_prompt_str: str,
new_prompt_name: str, score: float, normalized_score: float) and ensure you
import necessary typing items (Optional, Any, Optional["Prompt"] or a Prompt
type, and if needed Coroutine) or reference existing project types; update
function and variable annotations accordingly without changing logic.
| class Prompt: | ||
| def __init__(self, name, template, score=0): | ||
| self.name = name | ||
| self.template = template | ||
| self.score = score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add type hints and class docstring.
The Prompt class is missing type hints for its __init__ parameters and lacks a class-level docstring describing its purpose.
As per coding guidelines, apply this diff:
class Prompt:
+ """
+ Represents a prompt template with its associated metadata.
+
+ Attributes:
+ name: Unique identifier for the prompt.
+ template: The Jinja2 template string.
+ score: Performance score for ranking (default: 0).
+ """
- def __init__(self, name, template, score=0):
+ def __init__(self, name: str, template: str, score: float = 0) -> None:
self.name = name
self.template = template
self.score = score📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class Prompt: | |
| def __init__(self, name, template, score=0): | |
| self.name = name | |
| self.template = template | |
| self.score = score | |
| class Prompt: | |
| """ | |
| Represents a prompt template with its associated metadata. | |
| Attributes: | |
| name: Unique identifier for the prompt. | |
| template: The Jinja2 template string. | |
| score: Performance score for ranking (default: 0). | |
| """ | |
| def __init__(self, name: str, template: str, score: float = 0) -> None: | |
| self.name = name | |
| self.template = template | |
| self.score = score |
🤖 Prompt for AI Agents
In skyvern/evolution/prompt_manager.py around lines 7 to 11, the Prompt class
lacks a class-level docstring and type hints; add a concise docstring explaining
that Prompt represents a named prompt template with an associated score,
annotate the class attributes (name: str, template: str, score: int = 0) and
update the __init__ signature to use type hints (def __init__(self, name: str,
template: str, score: int = 0) -> None:) so static type checkers and IDEs can
validate usage.
| class PromptManager: | ||
| def __init__(self): | ||
| self.prompts = {} | ||
| self._load_baseline_prompt() | ||
|
|
||
| def _load_baseline_prompt(self): | ||
| """ | ||
| Loads the original 'extract-action.j2' prompt as the baseline. | ||
| """ | ||
| try: | ||
| # Access the Jinja2 environment from the prompt_engine | ||
| env = prompt_engine.env | ||
| # Construct the path to the template within the Jinja2 environment | ||
| template_path = "skyvern/extract-action.j2" | ||
| # Get the template source from the loader | ||
| baseline_template = env.loader.get_source(env, template_path)[0] | ||
|
|
||
| self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good. | ||
| LOG.info("Loaded baseline prompt 'extract-action.j2'.") | ||
| except Exception as e: | ||
| LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True) | ||
|
|
||
| def add_prompt(self, name, template, score=0): | ||
| """ | ||
| Adds a new prompt to the population. | ||
| """ | ||
| if name in self.prompts: | ||
| LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.") | ||
|
|
||
| self.prompts[name] = Prompt(name, template, score) | ||
| LOG.info(f"Added prompt '{name}' with score {score}.") | ||
|
|
||
| def get_prompt(self, name): | ||
| """ | ||
| Retrieves a prompt object by its name. | ||
| """ | ||
| return self.prompts.get(name) | ||
|
|
||
| def get_best_prompt(self): | ||
| """ | ||
| Returns the prompt with the highest score. | ||
| """ | ||
| if not self.prompts: | ||
| return None | ||
|
|
||
| return max(self.prompts.values(), key=lambda p: p.score) | ||
|
|
||
| def update_score(self, name, score): | ||
| """ | ||
| Updates the score of a prompt after evaluation. | ||
| """ | ||
| if name in self.prompts: | ||
| self.prompts[name].score = score | ||
| LOG.info(f"Updated score for prompt '{name}' to {score}.") | ||
| else: | ||
| LOG.warning(f"Prompt '{name}' not found for score update.") No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add type hints to all methods.
The PromptManager class methods are missing type hints for parameters and return values, which is required by the coding guidelines for Python 3.11+.
As per coding guidelines, apply these changes:
+from typing import Optional
+
class PromptManager:
- def __init__(self):
+ def __init__(self) -> None:
self.prompts = {}
self._load_baseline_prompt()
- def _load_baseline_prompt(self):
+ def _load_baseline_prompt(self) -> None:
"""
Loads the original 'extract-action.j2' prompt as the baseline.
"""
# ... rest of method
- def add_prompt(self, name, template, score=0):
+ def add_prompt(self, name: str, template: str, score: float = 0) -> None:
"""
Adds a new prompt to the population.
"""
# ... rest of method
- def get_prompt(self, name):
+ def get_prompt(self, name: str) -> Optional[Prompt]:
"""
Retrieves a prompt object by its name.
"""
return self.prompts.get(name)
- def get_best_prompt(self):
+ def get_best_prompt(self) -> Optional[Prompt]:
"""
Returns the prompt with the highest score.
"""
# ... rest of method
- def update_score(self, name, score):
+ def update_score(self, name: str, score: float) -> None:
"""
Updates the score of a prompt after evaluation.
"""
# ... rest of method📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class PromptManager: | |
| def __init__(self): | |
| self.prompts = {} | |
| self._load_baseline_prompt() | |
| def _load_baseline_prompt(self): | |
| """ | |
| Loads the original 'extract-action.j2' prompt as the baseline. | |
| """ | |
| try: | |
| # Access the Jinja2 environment from the prompt_engine | |
| env = prompt_engine.env | |
| # Construct the path to the template within the Jinja2 environment | |
| template_path = "skyvern/extract-action.j2" | |
| # Get the template source from the loader | |
| baseline_template = env.loader.get_source(env, template_path)[0] | |
| self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good. | |
| LOG.info("Loaded baseline prompt 'extract-action.j2'.") | |
| except Exception as e: | |
| LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True) | |
| def add_prompt(self, name, template, score=0): | |
| """ | |
| Adds a new prompt to the population. | |
| """ | |
| if name in self.prompts: | |
| LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.") | |
| self.prompts[name] = Prompt(name, template, score) | |
| LOG.info(f"Added prompt '{name}' with score {score}.") | |
| def get_prompt(self, name): | |
| """ | |
| Retrieves a prompt object by its name. | |
| """ | |
| return self.prompts.get(name) | |
| def get_best_prompt(self): | |
| """ | |
| Returns the prompt with the highest score. | |
| """ | |
| if not self.prompts: | |
| return None | |
| return max(self.prompts.values(), key=lambda p: p.score) | |
| def update_score(self, name, score): | |
| """ | |
| Updates the score of a prompt after evaluation. | |
| """ | |
| if name in self.prompts: | |
| self.prompts[name].score = score | |
| LOG.info(f"Updated score for prompt '{name}' to {score}.") | |
| else: | |
| LOG.warning(f"Prompt '{name}' not found for score update.") | |
| from typing import Optional | |
| class PromptManager: | |
| def __init__(self) -> None: | |
| self.prompts: dict[str, Prompt] = {} | |
| self._load_baseline_prompt() | |
| def _load_baseline_prompt(self) -> None: | |
| """ | |
| Loads the original 'extract-action.j2' prompt as the baseline. | |
| """ | |
| try: | |
| # Access the Jinja2 environment from the prompt_engine | |
| env = prompt_engine.env | |
| # Construct the path to the template within the Jinja2 environment | |
| template_path = "skyvern/extract-action.j2" | |
| # Get the template source from the loader | |
| baseline_template = env.loader.get_source(env, template_path)[0] | |
| self.add_prompt("baseline", baseline_template, score=1.0) # Assuming baseline is good. | |
| LOG.info("Loaded baseline prompt 'extract-action.j2'.") | |
| except Exception as e: | |
| LOG.error(f"Failed to load baseline prompt: {e}", exc_info=True) | |
| def add_prompt(self, name: str, template: str, score: float = 0) -> None: | |
| """ | |
| Adds a new prompt to the population. | |
| """ | |
| if name in self.prompts: | |
| LOG.warning(f"Prompt with name '{name}' already exists. Overwriting.") | |
| self.prompts[name] = Prompt(name, template, score) | |
| LOG.info(f"Added prompt '{name}' with score {score}.") | |
| def get_prompt(self, name: str) -> Optional[Prompt]: | |
| """ | |
| Retrieves a prompt object by its name. | |
| """ | |
| return self.prompts.get(name) | |
| def get_best_prompt(self) -> Optional[Prompt]: | |
| """ | |
| Returns the prompt with the highest score. | |
| """ | |
| if not self.prompts: | |
| return None | |
| return max(self.prompts.values(), key=lambda p: p.score) | |
| def update_score(self, name: str, score: float) -> None: | |
| """ | |
| Updates the score of a prompt after evaluation. | |
| """ | |
| if name in self.prompts: | |
| self.prompts[name].score = score | |
| LOG.info(f"Updated score for prompt '{name}' to {score}.") | |
| else: | |
| LOG.warning(f"Prompt '{name}' not found for score update.") |
…concepts in the AlphaEvolve research paper. The system is designed to iteratively improve the prompts used by the Skyvern agent, enhancing its performance and reliability.
The core of this feature is the new
skyvern/evolutionpackage, which includes:PromptManager: A class for managing a population of prompts and their performance scores.Evolve: A class that uses an LLM to generate new variations of prompts based on their performance.evolve-prompt.j2: A new prompt template to guide the LLM in the evolution process.A new script,
scripts/run_evolution.py, has been added to orchestrate the evolution loop, allowing for continuous improvement of the prompts.The
ForgeAgenthas been integrated with thePromptManagerto dynamically use the best-performing prompt for its tasks. This creates a feedback loop where the agent's performance can be improved over time by evolving the prompts it uses.🧬 This PR introduces a prompt evolution system inspired by AlphaEvolve research that automatically improves Skyvern agent prompts through iterative LLM-based evolution and performance scoring. The system creates a feedback loop where the best-performing prompts are selected and evolved to generate better variations over time.
🔍 Detailed Analysis
Key Changes
skyvern/evolution/withPromptManagerfor managing prompt populations andEvolveclass for LLM-based prompt generationForgeAgentto dynamically use the best-performing prompt from the evolution system instead of static templatesscripts/run_evolution.pyto orchestrate the evolution loop with configurable generationsevolve-prompt.j2template to guide LLM in generating improved prompt variationsPromptManagerinto the main application state viaapp.pyTechnical Implementation
flowchart TD A[PromptManager] --> B[Load Baseline Prompt] B --> C[Evolve Class] C --> D[Generate Variations via LLM] D --> E[Evaluate & Score Prompts] E --> F[Select Best Prompt] F --> G[ForgeAgent Uses Best Prompt] G --> H[Performance Feedback] H --> C I[run_evolution.py] --> J[Evolution Loop] J --> CImpact
Created with Palmier
Important
Introduces a prompt evolution system to improve Skyvern agent performance by managing and evolving prompts using LLM.
skyvern/evolutionpackage withPromptManagerandEvolveclasses for managing and evolving prompts.evolve-prompt.j2template for guiding LLM in prompt evolution.run_evolution.pyto run the prompt evolution loop.PromptManagerwithForgeAgentto use best-performing prompts.PromptManager: Manages prompt population and scores.Evolve: Generates new prompt variations using LLM.agent.py: Modifies_build_extract_action_prompt()to use evolved prompts.app.py: InitializesPROMPT_MANAGERfor prompt management.This description was created by
for 6bd63e6. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit