Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)#538
Open
jralfonsog wants to merge 1 commit into
Conversation
…atabricks-solutions#485) Curated grounds for the five skills assigned in issue databricks-solutions#485, following the A/B-fairness pattern established by jacksandom on eval/analyst-a-b-ground-truths (issue databricks-solutions#486, PR databricks-solutions#519): outputs.response rewritten as tool-agnostic natural- language descriptions of what each response should confirm, with empty expected_facts / expected_patterns / guidelines blocks. Changes per skill (vs origin/main): - databricks-agent-bricks: 5 cases rewritten (all 5 originals were MCP-prescriptive with manage_ka/manage_mas tool calls). Concept of "external MCP server via UC connection" preserved in case prompts since it's a Databricks Agent Bricks feature separate from the Skillforge-side MCP tools dichotomy. - databricks-model-serving: 8 cases rewritten (3 of 8 were MCP-prescriptive). - databricks-vector-search: 7 cases rewritten + 2 dropped. The dropped cases (vs_mcp_create_endpoint_008 and vs_mcp_manage_data_009) had "Use MCP tools to..." literal in the prompt; no tool-agnostic rewrite preserves intent. - databricks-ai-functions: 7 cases hand-authored from scratch. Origin/main has no .test/skills/databricks-ai-functions/ directory at all; manifest.yaml + ground_truth.yaml created to allow this skill to participate in the A/B. This is noted explicitly in the manifest source_note for transparency. - databricks-mlflow-evaluation: unchanged. Original ground_truth had 0 MCP-prescriptive mentions; already tool-agnostic. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Curated tool-agnostic A/B ground truths for the five skills assigned in issue #485, following the A/B-fairness pattern established by @jacksandom on
eval/analyst-a-b-ground-truths(issue #486, PR #519). Targeted atexperimental.Single commit.
outputs.responserewritten as tool-agnostic natural-language descriptions of what each response should confirm;expected_facts/expected_patterns/guidelinesblocks emptied; per-casemetadata:blocks preserved.Per-skill changes (vs
origin/experimental)databricks-agent-bricksmanage_ka/manage_mastool calls). The "external MCP server via UC connection" concept is preserved in case prompts — it's an Agent Bricks feature distinct from the Skillforge-side MCP tool dichotomy.databricks-model-servingmanage_serving_endpoint).databricks-vector-searchdatabricks-ai-functions.test/skills/databricks-ai-functions/directory — manifest and grounds were authored from scratch.source_notein the manifest documents this.databricks-mlflow-evaluationexperimental. No rewrites needed.Eval results
A/B comparison run with
stf compare(per-side.mcp.jsonenforcing A = MCP-only with 44 tools, B = CLI-only with 0 tools). Full verdict comments on the issue, one per skill:databricks-mlflow-evaluationdatabricks-agent-bricksdatabricks-vector-searchdatabricks-ai-functionsdatabricks-model-servingNet: experimental (CLI-first) doesn't hurt and helps modestly. The one clear win (
agent-bricks) is a real skill-content win — experimental's CLI-first guidance led the agent to a concrete artifact (a created Knowledge Assistant) while main went down a UC-plumbing detour. The other two B-wins are smaller and partly driven by truncation patterns.Scope and non-goals
evaluation_results.json/report.htmlfor each skill exist locally but aren't pinned to this branch (reproducibility is provided by the compare IDs + MLflow run links in the issue comments). Happy to add them in a second commit if useful as artifacts.Test plan
stf compareruns executed successfully againstvoodoo-labworkspace.servers=['databricks']count=44, B:servers=[]count=0).This pull request was AI-assisted by Isaac.