Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485) by jralfonsog · Pull Request #538 · databricks-solutions/ai-dev-kit

jralfonsog · 2026-05-19T10:44:39Z

Summary

Curated tool-agnostic A/B ground truths for the five skills assigned in issue #485, following the A/B-fairness pattern established by @jacksandom on eval/analyst-a-b-ground-truths (issue #486, PR #519). Targeted at experimental.

Single commit. outputs.response rewritten as tool-agnostic natural-language descriptions of what each response should confirm; expected_facts / expected_patterns / guidelines blocks emptied; per-case metadata: blocks preserved.

Per-skill changes (vs `origin/experimental`)

Skill	Cases	Notes
`databricks-agent-bricks`	5 rewritten	All 5 originals were MCP-prescriptive (`manage_ka` / `manage_mas` tool calls). The "external MCP server via UC connection" concept is preserved in case prompts — it's an Agent Bricks feature distinct from the Skillforge-side MCP tool dichotomy.
`databricks-model-serving`	8 rewritten	3 of 8 were MCP-prescriptive (`manage_serving_endpoint`).
`databricks-vector-search`	7 rewritten, 2 dropped	Two cases were MCP-only (no CLI counterpart possible) and were dropped rather than rewritten.
`databricks-ai-functions`	7 new (hand-authored)	The experimental branch had no `.test/skills/databricks-ai-functions/` directory — manifest and grounds were authored from scratch. `source_note` in the manifest documents this.
`databricks-mlflow-evaluation`	unchanged	Already had 0 MCP mentions on `experimental`. No rewrites needed.

Eval results

A/B comparison run with stf compare (per-side .mcp.json enforcing A = MCP-only with 44 tools, B = CLI-only with 0 tools). Full verdict comments on the issue, one per skill:

Skill	Source diff (main vs experimental)	Judge verdict	L3 (A / B)	L5 (A / B)	Comment
`databricks-mlflow-evaluation`	identical	TIE 0.20	0.87 / 0.89	0.36 / 0.54	link
`databricks-agent-bricks`	real (854+/703-)	B wins, 0.60	0.72 / 0.83	0.42 / 0.40	link
`databricks-vector-search`	real (28+/140-)	B wins, 0.55	0.80 / 0.88	0.50 / 0.50	link
`databricks-ai-functions`	identical	B wins, 0.55	0.90 / 0.89	0.49 / 0.56	link
`databricks-model-serving`	real (179+/272-)	TIE 0.30 (lean A)	0.77 / 0.84	0.56 / 0.55	link

Net: experimental (CLI-first) doesn't hurt and helps modestly. The one clear win (agent-bricks) is a real skill-content win — experimental's CLI-first guidance led the agent to a concrete artifact (a created Knowledge Assistant) while main went down a UC-plumbing detour. The other two B-wins are smaller and partly driven by truncation patterns.

Scope and non-goals

Only ground truth changes. No skill-content polish in this PR — that would be a follow-up after maintainers review the A/B results.
No eval outputs committed. evaluation_results.json / report.html for each skill exist locally but aren't pinned to this branch (reproducibility is provided by the compare IDs + MLflow run links in the issue comments). Happy to add them in a second commit if useful as artifacts.

Test plan

All 5 A/B stf compare runs executed successfully against voodoo-lab workspace.
Per-side MCP isolation verified at runtime (A: servers=['databricks'] count=44, B: servers=[] count=0).
All 5 verdict comments posted to issue Test experimental branch: ai-ml-engineering #485 with compare IDs, dimension scores, and rationale excerpts.
Maintainer review of ground truth quality (looking at you, @jacksandom / @calreynolds / @QuentinAmbard).

This pull request was AI-assisted by Isaac.

…atabricks-solutions#485) Curated grounds for the five skills assigned in issue databricks-solutions#485, following the A/B-fairness pattern established by jacksandom on eval/analyst-a-b-ground-truths (issue databricks-solutions#486, PR databricks-solutions#519): outputs.response rewritten as tool-agnostic natural- language descriptions of what each response should confirm, with empty expected_facts / expected_patterns / guidelines blocks. Changes per skill (vs origin/main): - databricks-agent-bricks: 5 cases rewritten (all 5 originals were MCP-prescriptive with manage_ka/manage_mas tool calls). Concept of "external MCP server via UC connection" preserved in case prompts since it's a Databricks Agent Bricks feature separate from the Skillforge-side MCP tools dichotomy. - databricks-model-serving: 8 cases rewritten (3 of 8 were MCP-prescriptive). - databricks-vector-search: 7 cases rewritten + 2 dropped. The dropped cases (vs_mcp_create_endpoint_008 and vs_mcp_manage_data_009) had "Use MCP tools to..." literal in the prompt; no tool-agnostic rewrite preserves intent. - databricks-ai-functions: 7 cases hand-authored from scratch. Origin/main has no .test/skills/databricks-ai-functions/ directory at all; manifest.yaml + ground_truth.yaml created to allow this skill to participate in the A/B. This is noted explicitly in the manifest source_note for transparency. - databricks-mlflow-evaluation: unchanged. Original ground_truth had 0 MCP-prescriptive mentions; already tool-agnostic. Co-authored-by: Isaac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)#538

Add tool-agnostic A/B ground truths for ai-ml-engineering skills (issue #485)#538
jralfonsog wants to merge 1 commit into
databricks-solutions:experimentalfrom
jralfonsog:eval/ai-ml-engineering-a-b-ground-truths

jralfonsog commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jralfonsog commented May 19, 2026

Summary

Per-skill changes (vs origin/experimental)

Eval results

Scope and non-goals

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Per-skill changes (vs `origin/experimental`)