Skip to content

Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)#538

Open
jralfonsog wants to merge 1 commit into
databricks-solutions:experimentalfrom
jralfonsog:eval/ai-ml-engineering-a-b-ground-truths
Open

Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)#538
jralfonsog wants to merge 1 commit into
databricks-solutions:experimentalfrom
jralfonsog:eval/ai-ml-engineering-a-b-ground-truths

Conversation

@jralfonsog
Copy link
Copy Markdown
Collaborator

Summary

Curated tool-agnostic A/B ground truths for the five skills assigned in issue #485, following the A/B-fairness pattern established by @jacksandom on eval/analyst-a-b-ground-truths (issue #486, PR #519). Targeted at experimental.

Single commit. outputs.response rewritten as tool-agnostic natural-language descriptions of what each response should confirm; expected_facts / expected_patterns / guidelines blocks emptied; per-case metadata: blocks preserved.

Per-skill changes (vs origin/experimental)

Skill Cases Notes
databricks-agent-bricks 5 rewritten All 5 originals were MCP-prescriptive (manage_ka / manage_mas tool calls). The "external MCP server via UC connection" concept is preserved in case prompts — it's an Agent Bricks feature distinct from the Skillforge-side MCP tool dichotomy.
databricks-model-serving 8 rewritten 3 of 8 were MCP-prescriptive (manage_serving_endpoint).
databricks-vector-search 7 rewritten, 2 dropped Two cases were MCP-only (no CLI counterpart possible) and were dropped rather than rewritten.
databricks-ai-functions 7 new (hand-authored) The experimental branch had no .test/skills/databricks-ai-functions/ directory — manifest and grounds were authored from scratch. source_note in the manifest documents this.
databricks-mlflow-evaluation unchanged Already had 0 MCP mentions on experimental. No rewrites needed.

Eval results

A/B comparison run with stf compare (per-side .mcp.json enforcing A = MCP-only with 44 tools, B = CLI-only with 0 tools). Full verdict comments on the issue, one per skill:

Skill Source diff (main vs experimental) Judge verdict L3 (A / B) L5 (A / B) Comment
databricks-mlflow-evaluation identical TIE 0.20 0.87 / 0.89 0.36 / 0.54 link
databricks-agent-bricks real (854+/703-) B wins, 0.60 0.72 / 0.83 0.42 / 0.40 link
databricks-vector-search real (28+/140-) B wins, 0.55 0.80 / 0.88 0.50 / 0.50 link
databricks-ai-functions identical B wins, 0.55 0.90 / 0.89 0.49 / 0.56 link
databricks-model-serving real (179+/272-) TIE 0.30 (lean A) 0.77 / 0.84 0.56 / 0.55 link

Net: experimental (CLI-first) doesn't hurt and helps modestly. The one clear win (agent-bricks) is a real skill-content win — experimental's CLI-first guidance led the agent to a concrete artifact (a created Knowledge Assistant) while main went down a UC-plumbing detour. The other two B-wins are smaller and partly driven by truncation patterns.

Scope and non-goals

  • Only ground truth changes. No skill-content polish in this PR — that would be a follow-up after maintainers review the A/B results.
  • No eval outputs committed. evaluation_results.json / report.html for each skill exist locally but aren't pinned to this branch (reproducibility is provided by the compare IDs + MLflow run links in the issue comments). Happy to add them in a second commit if useful as artifacts.

Test plan

  • All 5 A/B stf compare runs executed successfully against voodoo-lab workspace.
  • Per-side MCP isolation verified at runtime (A: servers=['databricks'] count=44, B: servers=[] count=0).
  • All 5 verdict comments posted to issue Test experimental branch: ai-ml-engineering #485 with compare IDs, dimension scores, and rationale excerpts.
  • Maintainer review of ground truth quality (looking at you, @jacksandom / @calreynolds / @QuentinAmbard).

This pull request was AI-assisted by Isaac.

…atabricks-solutions#485)

Curated grounds for the five skills assigned in issue databricks-solutions#485, following the
A/B-fairness pattern established by jacksandom on eval/analyst-a-b-ground-truths
(issue databricks-solutions#486, PR databricks-solutions#519): outputs.response rewritten as tool-agnostic natural-
language descriptions of what each response should confirm, with empty
expected_facts / expected_patterns / guidelines blocks.

Changes per skill (vs origin/main):

- databricks-agent-bricks: 5 cases rewritten (all 5 originals were MCP-prescriptive
  with manage_ka/manage_mas tool calls). Concept of "external MCP server via UC
  connection" preserved in case prompts since it's a Databricks Agent Bricks feature
  separate from the Skillforge-side MCP tools dichotomy.
- databricks-model-serving: 8 cases rewritten (3 of 8 were MCP-prescriptive).
- databricks-vector-search: 7 cases rewritten + 2 dropped. The dropped cases
  (vs_mcp_create_endpoint_008 and vs_mcp_manage_data_009) had "Use MCP tools to..."
  literal in the prompt; no tool-agnostic rewrite preserves intent.
- databricks-ai-functions: 7 cases hand-authored from scratch. Origin/main has
  no .test/skills/databricks-ai-functions/ directory at all; manifest.yaml +
  ground_truth.yaml created to allow this skill to participate in the A/B.
  This is noted explicitly in the manifest source_note for transparency.
- databricks-mlflow-evaluation: unchanged. Original ground_truth had 0
  MCP-prescriptive mentions; already tool-agnostic.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant