diff --git a/.test/skills/databricks-agent-bricks/ground_truth.yaml b/.test/skills/databricks-agent-bricks/ground_truth.yaml index 5d4f804f..fcfc5f73 100644 --- a/.test/skills/databricks-agent-bricks/ground_truth.yaml +++ b/.test/skills/databricks-agent-bricks/ground_truth.yaml @@ -7,57 +7,16 @@ test_cases: inputs: prompt: Create a Knowledge Assistant for HR policy documents stored in /Volumes/my_catalog/my_schema/raw_data/hr_docs outputs: - response: "I'll create a Knowledge Assistant for your HR policy documents using\ - \ the `manage_ka` tool.\n\n```python\nmanage_ka(\n action=\"create_or_update\"\ - ,\n name=\"HR_Policy_Assistant\",\n volume_path=\"/Volumes/my_catalog/my_schema/raw_data/hr_docs\"\ - ,\n description=\"Answers questions about HR policies and procedures from\ - \ indexed documents\",\n instructions=\"Be helpful and professional. When\ - \ answering:\\n1. Always cite the specific document and section\\n2. If multiple\ - \ documents are relevant, mention all of them\\n3. If the information isn't\ - \ in the documents, clearly say so\\n4. Use bullet points for multi-part answers\"\ - \n)\n```\n\nThis will:\n1. Create a Knowledge Assistant that indexes all PDF\ - \ and text files in the volume\n2. Automatically scan for JSON files with example\ - \ question/guideline pairs and add them as examples\n3. Begin provisioning the\ - \ endpoint (takes 2-5 minutes to reach `ONLINE` status)\n\nYou can check the\ - \ status using `manage_ka(action=\"get\", tile_id=...)` with the returned `tile_id`.\ - \ The endpoint status will progress from `PROVISIONING` to `ONLINE`." + response: | + Confirms a Knowledge Assistant was created over the specified volume (/Volumes/my_catalog/my_schema/raw_data/hr_docs), with a clear description and answer-quality instructions covering source citation, multi-document handling, and acknowledging gaps when information is missing. Returns the new KA identifier and notes that provisioning runs asynchronously for a few minutes before the endpoint reaches an ONLINE state. execution_success: true expectations: - expected_facts: - - Uses manage_ka tool with action="create_or_update" - - Includes volume_path parameter - - Includes description for the KA - - Includes instructions for answer quality - expected_patterns: - - pattern: manage_ka - min_count: 1 - description: Must use the manage_ka MCP tool - - pattern: create_or_update - min_count: 1 - description: Must specify action="create_or_update" - - pattern: volume_path - min_count: 1 - description: Must specify volume_path parameter - - pattern: /Volumes/my_catalog/my_schema/raw_data/hr_docs - min_count: 1 - description: Must reference the specified volume path - - pattern: name - min_count: 1 - description: Must provide a name for the KA - guidelines: - - Must use manage_ka with action="create_or_update" and volume_path pointing to - the specified volume - - Should include a meaningful description for the KA - - Should include instructions to guide the KA's answer behavior - - Should explain provisioning timeline (PROVISIONING to ONLINE) + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: easy source: interactive - tags: - - ka - - create - - knowledge-assistant created_at: '2026-02-13T16:13:55.785471' execution_verified: mode: local @@ -68,82 +27,16 @@ test_cases: a Genie space for sales data (space_id 'def456'), and an external MCP server via UC connection 'erp_mcp_connection' for invoice operations outputs: - response: "I'll create a Multi-Agent Supervisor that combines a Knowledge Assistant,\ - \ a Genie Space, and an external MCP server using the `manage_mas` tool.\n\n\ - ```python\nmanage_mas(\n action=\"create_or_update\",\n name=\"Enterprise_Support_Supervisor\"\ - ,\n agents=[\n {\n \"name\": \"policy_agent\",\n \ - \ \"ka_tile_id\": \"abc123\",\n \"description\": \"Answers questions\ - \ about company policies, procedures, and compliance requirements from indexed\ - \ policy documents\"\n },\n {\n \"name\": \"sales_analytics\"\ - ,\n \"genie_space_id\": \"def456\",\n \"description\"\ - : \"SQL analytics on sales data: revenue trends, customer metrics, order history,\ - \ and pipeline analysis\"\n },\n {\n \"name\": \"erp_operations\"\ - ,\n \"connection_name\": \"erp_mcp_connection\",\n \ - \ \"description\": \"Execute ERP invoice operations: create, approve,\ - \ reject, and search invoices, retrieve vendor details, and trigger batch processing\ - \ workflows\"\n }\n ],\n description=\"Unified enterprise assistant\ - \ combining policy guidance, sales analytics, and ERP invoice operations\",\n\ - \ instructions=\"Route queries as follows:\\n- Policy and compliance questions\ - \ -> policy_agent\\n- Sales data, metrics, and analytics questions -> sales_analytics\\\ - n- Invoice actions (create, approve, reject, search) and ERP operations -> erp_operations\\\ - n\\nIf a query spans multiple domains, address the primary intent first.\"\n\ - )\n```\n\nThis creates a Supervisor Agent with three specialized agents:\n\n\ - 1. **policy_agent** - Uses the Knowledge Assistant (tile_id `abc123`) for document-based\ - \ Q&A on policies\n2. **sales_analytics** - Uses the Genie Space (space_id `def456`)\ - \ for natural language SQL queries on sales data\n3. **erp_operations** - Connects\ - \ to an external MCP server via the UC HTTP connection `erp_mcp_connection`\ - \ for invoice operations\n\nThe endpoint will take 2-5 minutes to provision.\ - \ Check status with `manage_mas(action=\"get\", tile_id=...)` using the returned\ - \ `tile_id`. Each agent type uses a different identifier field:\n- `ka_tile_id`\ - \ for Knowledge Assistants\n- `genie_space_id` for Genie Spaces\n- `connection_name`\ - \ for external MCP servers (UC HTTP connection name)" + response: | + Confirms the Multi-Agent Supervisor was created with three specialized child agents — one bound to the Knowledge Assistant (tile_id 'abc123') for policy Q&A, one bound to the Genie Space (space_id 'def456') for sales analytics, and one connecting to the external MCP server via the UC HTTP connection 'erp_mcp_connection' for invoice operations. Each agent has a routing-quality description, and the supervisor's instructions map query categories to the appropriate child agent. Returns the new MAS identifier and notes the provisioning timeline. execution_success: true expectations: - expected_facts: - - Uses manage_mas tool with action="create_or_update" - - Includes ka_tile_id for the Knowledge Assistant agent - - Includes genie_space_id for the Genie Space agent - - Uses connection_name for external MCP server agent - - Each agent has a name and description - expected_patterns: - - pattern: manage_mas - min_count: 1 - description: Must use the manage_mas MCP tool - - pattern: create_or_update - min_count: 1 - description: Must specify action="create_or_update" - - pattern: ka_tile_id - min_count: 1 - description: Must use ka_tile_id for the KA agent - - pattern: genie_space_id - min_count: 1 - description: Must use genie_space_id for the Genie agent - - pattern: connection_name - min_count: 1 - description: Must use connection_name field for external MCP server agent - - pattern: erp_mcp_connection - min_count: 1 - description: Must reference the specified UC connection name - guidelines: - - Must use manage_mas with action="create_or_update" and an agents list containing - all three agent types - - KA agent must use ka_tile_id (not endpoint_name) with the provided tile_id - - Genie agent must use genie_space_id with the provided space_id - - MCP server agent must use connection_name field with the UC connection name - - Each agent must have a specific, routing-quality description - - Should include routing instructions that map query types to agents + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: interactive - tags: - - mas - - create - - multi-agent - - mcp-server - - genie - - ka - - connection-name created_at: '2026-02-13T16:19:54.625799' execution_verified: mode: local @@ -153,76 +46,16 @@ test_cases: prompt: Create a Supervisor Agent that includes a UC function sales_analytics.utils.enrich_customer_data for data enrichment alongside a KA for product documentation outputs: - response: "I'll create a Supervisor Agent that combines a Unity Catalog function\ - \ for data enrichment with a Knowledge Assistant for product documentation using\ - \ the `manage_mas` tool.\n\n```python\nmanage_mas(\n action=\"create_or_update\"\ - ,\n name=\"Product_Support_Supervisor\",\n agents=[\n {\n \ - \ \"name\": \"product_docs\",\n \"ka_tile_id\": \"\"\ - ,\n \"description\": \"Answers questions about product features,\ - \ specifications, release notes, and usage guides from indexed product documentation\"\ - \n },\n {\n \"name\": \"data_enrichment\",\n \ - \ \"uc_function_name\": \"sales_analytics.utils.enrich_customer_data\"\ - ,\n \"description\": \"Enriches customer records with demographic\ - \ and purchase history data. Use when the user asks to enrich, augment, or enhance\ - \ customer data.\"\n }\n ],\n description=\"Product support assistant\ - \ with documentation Q&A and customer data enrichment capabilities\",\n instructions=\"\ - Route queries as follows:\\n- Product questions, feature inquiries, and documentation\ - \ lookups -> product_docs\\n- Requests to enrich, augment, or look up customer\ - \ data -> data_enrichment\\n\\nIf the user asks about a customer in the context\ - \ of a product, use product_docs first.\"\n)\n```\n\nThis creates a Supervisor\ - \ Agent with two specialized agents:\n\n1. **product_docs** - Uses a Knowledge\ - \ Assistant (`ka_tile_id`) for document-based Q&A on product documentation\n\ - 2. **data_enrichment** - Calls the UC function `sales_analytics.utils.enrich_customer_data`\ - \ directly for data operations\n\n**Important**: The UC function must already\ - \ exist and the agent service principal needs `EXECUTE` privilege on `sales_analytics.utils.enrich_customer_data`.\n\ - \nReplace `` with the actual tile ID of your product documentation\ - \ KA. You can find it using `manage_ka(action=\"find_by_name\", name=...)` if\ - \ you know the KA name.\n\nThe endpoint will take 2-5 minutes to provision.\ - \ Check status with `manage_mas(action=\"get\", tile_id=...)` using the returned\ - \ `tile_id`." + response: | + Confirms the Supervisor Agent was created with two specialized child agents: one bound to a Knowledge Assistant for product documentation and one bound to the Unity Catalog function 'sales_analytics.utils.enrich_customer_data' for customer data enrichment. Both agents carry routing-quality descriptions, and the supervisor instructions cover product-information and data-enrichment intents. Highlights that the UC function must already exist and the agent service principal needs EXECUTE privilege on the fully-qualified function name. Returns the new MAS identifier. execution_success: true expectations: - expected_facts: - - Uses manage_mas tool with action="create_or_update" - - Includes uc_function_name for the UC function agent - - Includes ka_tile_id for the Knowledge Assistant agent - - References sales_analytics.utils.enrich_customer_data - - Mentions EXECUTE privilege requirement - expected_patterns: - - pattern: manage_mas - min_count: 1 - description: Must use the manage_mas MCP tool - - pattern: create_or_update - min_count: 1 - description: Must specify action="create_or_update" - - pattern: uc_function_name - min_count: 1 - description: Must use uc_function_name field for UC function agent - - pattern: sales_analytics\.utils\.enrich_customer_data - min_count: 1 - description: Must reference the fully-qualified UC function name - - pattern: ka_tile_id - min_count: 1 - description: Must use ka_tile_id for the KA agent - guidelines: - - Must use manage_mas with action="create_or_update" and an agents list containing - both agent types - - UC function agent must use uc_function_name field (not endpoint_name) - - UC function name must be fully qualified as catalog.schema.function_name - - KA agent must use ka_tile_id field - - Should mention that the UC function must already exist and EXECUTE privilege - is needed - - Each agent must have a specific, routing-quality description + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: interactive - tags: - - mas - - create - - uc-function - - ka - - multi-agent created_at: '2026-02-13T16:21:37.882361' execution_verified: mode: local @@ -232,76 +65,16 @@ test_cases: prompt: Find the Knowledge Assistant named 'HR_Policy_Assistant' and use it as an agent in a new Supervisor Agent called 'Support MAS' alongside a Genie space outputs: - response: "I'll first look up the Knowledge Assistant by name, then create the\ - \ Supervisor Agent using the returned tile_id.\n\n**Step 1: Find the KA by name**\n\ - \n```python\nmanage_ka(\n action=\"find_by_name\",\n name=\"HR_Policy_Assistant\"\ - \n)\n```\n\nThis returns the KA details including `tile_id`, `endpoint_name`,\ - \ and `endpoint_status` (e.g., `{\"found\": true, \"tile_id\": \"f32c5f73-...\"\ - , \"name\": \"HR_Policy_Assistant\"}`).\n\n**Step 2: Create the Supervisor Agent**\n\ - \nUsing the `tile_id` from the lookup, create the MAS with `ka_tile_id` for\ - \ the KA agent and `genie_space_id` for the Genie space:\n\n```python\nmanage_mas(\n\ - \ action=\"create_or_update\",\n name=\"Support_MAS\",\n agents=[\n\ - \ {\n \"name\": \"hr_policy_agent\",\n \"ka_tile_id\"\ - : \"\",\n \"description\": \"\ - Answers questions about HR policies, procedures, benefits, and employee handbook\ - \ from indexed policy documents\"\n },\n {\n \"name\"\ - : \"data_analyst\",\n \"genie_space_id\": \"\"\ - ,\n \"description\": \"SQL analytics on operational data: headcount\ - \ metrics, department statistics, and workforce trends\"\n }\n ],\n\ - \ description=\"Unified support assistant combining HR policy guidance and\ - \ workforce analytics\",\n instructions=\"Route queries as follows:\\n- HR\ - \ policy, benefits, PTO, and handbook questions -> hr_policy_agent\\n- Data\ - \ questions, metrics, headcount, and workforce statistics -> data_analyst\\n\\\ - nIf unclear, ask the user to clarify whether they need a policy answer or a\ - \ data lookup.\"\n)\n```\n\nThe two-step workflow ensures you reference the\ - \ correct KA by its `tile_id` rather than guessing the ID. The `manage_ka(action=\"\ - find_by_name\")` tool is the recommended way to look up a KA when you know its\ - \ name but not its tile ID." + response: | + Performs a clear two-step workflow: first looks up the Knowledge Assistant 'HR_Policy_Assistant' by name to obtain its identifier (rather than hardcoding one), then creates the 'Support MAS' Supervisor Agent referencing that KA alongside a Genie Space agent for workforce analytics. Each child agent has a routing-quality description, and the supervisor's instructions cover policy and data-analytics intents. Returns the new MAS identifier and the resolved KA identifier. execution_success: true expectations: - expected_facts: - - Uses manage_ka with action="find_by_name" to look up the KA - - Uses manage_mas with action="create_or_update" to create the Supervisor Agent - - Uses ka_tile_id from the manage_ka find_by_name result - - Includes genie_space_id for the Genie space agent - - 'Two-step workflow: lookup first, then create' - expected_patterns: - - pattern: manage_ka - min_count: 1 - description: Must use manage_ka with action="find_by_name" to look up the KA - - pattern: find_by_name - min_count: 1 - description: Must use the find_by_name action - - pattern: HR_Policy_Assistant - min_count: 1 - description: Must reference the specified KA name - - pattern: manage_mas - min_count: 1 - description: Must use manage_mas with action="create_or_update" to create the MAS - - pattern: ka_tile_id - min_count: 1 - description: Must use ka_tile_id for the KA agent in the MAS - - pattern: genie_space_id - min_count: 1 - description: Must use genie_space_id for the Genie agent - guidelines: - - Must use manage_ka with action="find_by_name" first to look up the KA tile_id, - not hardcode an ID - - Must use ka_tile_id (not endpoint_name) when adding a KA to the MAS - - Must use genie_space_id for the Genie space agent - - 'Should show a clear two-step workflow: lookup then create' - - Each agent must have a specific, routing-quality description + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: interactive - tags: - - mas - - ka - - genie - - find-by-name - - multi-step - - lookup created_at: '2026-02-13T16:23:38.755002' execution_verified: mode: local @@ -311,71 +84,16 @@ test_cases: prompt: Update an existing MAS named 'Customer_Support' to add a new agent that connects to an external CRM system via UC HTTP connection 'crm_mcp_server' outputs: - response: "I'll look up the existing MAS by name, retrieve its current configuration,\ - \ then update it with the new CRM agent.\n\n**Step 1: Find the existing MAS**\n\ - \n```python\nmanage_mas(\n action=\"find_by_name\",\n name=\"Customer_Support\"\ - \n)\n```\n\nThis returns the MAS details including `tile_id` and `agents_count`\ - \ (e.g., `{\"found\": true, \"tile_id\": \"01xyz...\", \"name\": \"Customer_Support\"\ - , \"agents_count\": 2}`).\n\n**Step 2: Get current configuration**\n\n```python\n\ - manage_mas(\n action=\"get\",\n tile_id=\"\"\ - \n)\n```\n\nThis returns the full MAS configuration including the existing agents\ - \ list, description, and instructions.\n\n**Step 3: Update the MAS with the\ - \ new CRM agent**\n\nInclude all existing agents plus the new one. The `manage_mas`\ - \ tool with `action=\"create_or_update\"` finds the existing MAS by name and\ - \ updates it:\n\n```python\nmanage_mas(\n action=\"create_or_update\",\n\ - \ name=\"Customer_Support\",\n agents=[\n # ... include existing\ - \ agents from manage_mas get response ...\n {\n \"name\": \"\ - crm_operations\",\n \"connection_name\": \"crm_mcp_server\"\ - ,\n \"description\": \"Interact with the CRM system: look up customer\ - \ profiles, create and update support tickets, view interaction history, and\ - \ manage account details\"\n }\n ],\n description=\"Customer support\ - \ assistant with CRM integration\",\n instructions=\"Route queries as follows:\\\ - n- CRM lookups, ticket creation, customer profiles, and account management ->\ - \ crm_operations\\n- Route other queries to existing agents based on their descriptions.\"\ - \n)\n```\n\n**Important**: When updating a MAS, you must include all agents\ - \ in the `agents` list -- both the existing ones from `manage_mas(action=\"get\"\ - )` and the new CRM agent. Omitting existing agents will remove them from the\ - \ configuration.\n\nThe `manage_mas` tool with `action=\"create_or_update\"`\ - \ matches by name and performs an update rather than creating a duplicate." + response: | + Follows a multi-step workflow: locates the existing 'Customer_Support' MAS by name, retrieves its current configuration (including the existing agents list), then updates it by adding a new agent connected to the external MCP server via the UC HTTP connection 'crm_mcp_server' for CRM operations. Preserves the original MAS name and re-includes all prior agents in the update payload (omitting them would drop them). Returns confirmation of the update. execution_success: true expectations: - expected_facts: - - "Uses manage_mas with action=\"find_by_name\" to look up the existing MAS" - - "Uses manage_mas with action=\"get\" to retrieve the current configuration" - - "Uses manage_mas with action=\"create_or_update\" to update the MAS" - - "Uses connection_name for the external MCP server agent" - - "Warns that all existing agents must be included in the update" - expected_patterns: - - pattern: "manage_mas" - min_count: 3 - description: "Must use manage_mas for find_by_name, get, and create_or_update" - - pattern: "find_by_name" - min_count: 1 - description: "Must use the find_by_name action to look up the MAS" - - pattern: "Customer_Support" - min_count: 1 - description: "Must reference the specified MAS name" - - pattern: "action" - min_count: 3 - description: "Must specify action parameter for each manage_mas call" - - pattern: "connection_name" - min_count: 1 - description: "Must use connection_name field for external MCP server agent" - - pattern: "crm_mcp_server" - min_count: 1 - description: "Must reference the specified UC connection name" - guidelines: - - "Must follow a multi-step workflow: find by name, get current config, then update" - - "Must use manage_mas with action=\"find_by_name\" (not hardcode a tile_id) to locate the existing MAS" - - "Must use manage_mas with action=\"get\" to read the current agents before updating" - - "New MCP agent must use connection_name field with the UC connection name" - - "Must warn that omitting existing agents from the update will remove them" - - "Should preserve the original MAS name when updating" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: hard source: interactive - tags: ["mas", "update", "mcp-server", "connection-name", "find-by-name", "multi-step"] created_at: '2026-02-13T16:25:01.420945' execution_verified: mode: local diff --git a/.test/skills/databricks-ai-functions/ground_truth.yaml b/.test/skills/databricks-ai-functions/ground_truth.yaml new file mode 100644 index 00000000..c761ec51 --- /dev/null +++ b/.test/skills/databricks-ai-functions/ground_truth.yaml @@ -0,0 +1,121 @@ +metadata: + skill_name: databricks-ai-functions + version: 0.1.0 + created_at: '2026-05-07T00:00:00.000000' + source_note: | + Hand-authored for A/B evaluation (issue #485). Original `.test/skills/databricks-ai-functions/` + did not exist on origin/experimental — created from scratch following the SKILL.md surface + so the skill can participate in the A/B compare with the other ai-ml-engineering skills. + +test_cases: +- id: aif_classify_001 + inputs: + prompt: "Classify each row of a support_tickets table into urgent / not urgent / spam using a built-in Databricks AI function." + outputs: + response: | + Adds a priority column to support_tickets by classifying ticket_text against the three labels ('urgent', 'not urgent', 'spam') using the task-specific ai_classify function (preferred over ai_query for fixed-label routing). Shows the SQL form and optionally the PySpark equivalent (`spark.sql` or `expr("ai_classify(...)")`). Notes that label lists should be 2–5 mutually exclusive labels for best accuracy and that NULL inputs return NULL. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_extract_001 + inputs: + prompt: "Extract product, error_code, and date entities from a ticket_text column." + outputs: + response: | + Uses ai_extract on ticket_text with the entity list ['product', 'error_code', 'date'] (task-specific function preferred over ai_query for entity extraction). Confirms the result is a STRUCT with one field per entity, demonstrates how to access nested fields (e.g. entities.product), and shows either the SQL or PySpark variant. Notes that missing entities come back as NULL. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_mask_pii_001 + inputs: + prompt: "Redact person names, emails, phone numbers, and addresses from a raw_messages table before storing to a safe table." + outputs: + response: | + Uses ai_mask on the message column with the entity list ['person', 'email', 'phone', 'address'] and writes the redacted output to catalog.schema.messages_safe in Delta append mode. Notes that ai_mask is the right task-specific function for PII redaction (do not use ai_query). Confirms idempotent appends and that the original raw table is untouched. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_summarize_001 + inputs: + prompt: "Summarize a content column of customer reviews down to about 30 words each." + outputs: + response: | + Uses ai_summarize(content, 30) to produce per-row summaries capped at ~30 words. Prefers the task-specific function over ai_query for summarization. Notes that the second argument is the target word count and that passing 0 disables the cap. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_parse_document_001 + inputs: + prompt: "Parse PDFs in /Volumes/catalog/schema/landing/documents/ and extract dates, amounts, and vendors from each." + outputs: + response: | + Reads files as binaryFile from the UC Volume, calls ai_parse_document on the binary content, and filters out rows where parse_error is non-null. Then chains ai_extract on the parsed text blocks with the entity list ['date', 'amount', 'vendor'] to produce a structured downstream column. Notes that ai_parse_document requires DBR 17.1+ and that filtering on parse_error IS NULL avoids polluting downstream tables with malformed inputs. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_query_structured_001 + inputs: + prompt: "Extract invoice records with nested line items as structured JSON from a parsed_documents.text_blocks column. The shape needs numero, total, and an itens array with codigo/descricao/qtde/vlrUnit per line." + outputs: + response: | + Uses ai_query as a last resort (justified because the output schema has a nested array and no task-specific function covers nested invoice extraction). Passes responseFormat='{"type":"json_object"}' to constrain the model output to JSON, sets failOnError=false so individual row failures surface as STRUCT.error instead of raising, and parses the returned JSON with from_json against an explicit STRUCT schema covering numero, total, and itens. Explains why ai_query is appropriate here (where task-specific functions are not). + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' + +- id: aif_forecast_001 + inputs: + prompt: "Forecast daily_sales.sales out to 2026-12-31 using a built-in Databricks function." + outputs: + response: | + Uses the table-valued ai_forecast function with observed=TABLE(SELECT date, sales FROM daily_sales), horizon='2026-12-31', time_col='date', value_col='sales'. Confirms the result columns: date, sales_forecast, sales_upper, sales_lower (point forecast plus a confidence interval). Notes that ai_forecast requires a Pro or Serverless SQL warehouse — not available on Classic or Starter. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: hand_authored + created_at: '2026-05-07T00:00:00.000000' diff --git a/.test/skills/databricks-ai-functions/manifest.yaml b/.test/skills/databricks-ai-functions/manifest.yaml new file mode 100644 index 00000000..17f2e78d --- /dev/null +++ b/.test/skills/databricks-ai-functions/manifest.yaml @@ -0,0 +1,23 @@ +skill_name: databricks-ai-functions +tool_modules: [] +description: | + Manually authored ground truth for A/B evaluation of databricks-ai-functions + (main MCP vs experimental CLI). Cases cover the task-specific functions + (ai_classify, ai_extract, ai_summarize, ai_mask), structured JSON via ai_query, + ai_parse_document for PDFs, and ai_forecast for time series. +scorers: + enabled: + - pattern_adherence + - no_hallucinated_apis + - expected_facts_present + llm_scorers: + - Safety + - guidelines_from_expectations + default_guidelines: + - Response must address the user's request completely + - SQL and PySpark examples must follow documented best practices + - Response must prefer task-specific functions over ai_query when applicable +quality_gates: + syntax_valid: 1.0 + pattern_adherence: 0.9 + execution_success: 0.8 diff --git a/.test/skills/databricks-model-serving/ground_truth.yaml b/.test/skills/databricks-model-serving/ground_truth.yaml index b0a318dd..a74402a1 100644 --- a/.test/skills/databricks-model-serving/ground_truth.yaml +++ b/.test/skills/databricks-model-serving/ground_truth.yaml @@ -1,659 +1,133 @@ -test_cases: - # ============================================================================ - # GenAI Agents - ResponsesAgent - # ============================================================================ - - id: "ms_responses_agent_001" - inputs: - prompt: "Create a simple ResponsesAgent that echoes user messages" - outputs: - response: | - Here's a simple ResponsesAgent that echoes user messages: - - ```python - # echo_agent.py - import mlflow - from mlflow.pyfunc import ResponsesAgent - from mlflow.types.responses import ( - ResponsesAgentRequest, - ResponsesAgentResponse, - ResponsesAgentStreamEvent, - ) - from typing import Generator - - - class EchoAgent(ResponsesAgent): - """Agent that echoes back user messages.""" - - def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse: - """Echo the user's message back.""" - last_message = request.input[-1] - echo_text = f"Echo: {last_message.content}" - - # Use helper method - NOT raw dicts - return ResponsesAgentResponse( - output=[self.create_text_output_item(text=echo_text, id="echo_1")] - ) - - def predict_stream( - self, request: ResponsesAgentRequest - ) -> Generator[ResponsesAgentStreamEvent, None, None]: - """Stream the echo response.""" - result = self.predict(request) - for item in result.output: - yield ResponsesAgentStreamEvent( - type="response.output_item.done", - item=item - ) - - - # Export for MLflow - AGENT = EchoAgent() - mlflow.models.set_model(AGENT) - ``` - - Key points: - - Uses `self.create_text_output_item()` for output (NOT raw dicts) - - Implements both `predict` and `predict_stream` - - Exports with `mlflow.models.set_model()` - execution_success: true - expectations: - expected_facts: - - "ResponsesAgent" - - "create_text_output_item" - - "mlflow.models.set_model" - expected_patterns: - - pattern: "class.*\\(ResponsesAgent\\)" - min_count: 1 - description: "ResponsesAgent subclass" - - pattern: "self\\.create_text_output_item\\(" - min_count: 1 - description: "helper method for output" - - pattern: "mlflow\\.models\\.set_model\\(" - min_count: 1 - description: "model export" - guidelines: - - "Must use ResponsesAgent pattern, not ChatAgent" - - "Must use self.create_text_output_item() helper method, NOT raw dicts" - - "Must export agent with mlflow.models.set_model()" - metadata: - category: "happy_path" - difficulty: "easy" - source: "cursor_generated" - tags: ["genai", "responses-agent", "basic"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Classical ML - Autolog - # ============================================================================ - - id: "ms_classical_ml_001" - inputs: - prompt: "Train and deploy an sklearn classifier with autolog to Unity Catalog" - outputs: - response: | - Here's how to train and deploy an sklearn classifier with autolog: - - ```python - import mlflow - import mlflow.sklearn - from sklearn.linear_model import LogisticRegression - from sklearn.model_selection import train_test_split - from sklearn.datasets import load_iris - - # Configuration - catalog = "main" - schema = "models" - model_name = "iris_classifier" - - # Enable autolog with auto-registration to Unity Catalog - mlflow.sklearn.autolog( - log_input_examples=True, - registered_model_name=f"{catalog}.{schema}.{model_name}" - ) - - # Load and prepare data - X, y = load_iris(return_X_y=True) - X_train, X_test, y_train, y_test = train_test_split( - X, y, test_size=0.25, random_state=42 - ) - - # Train model - automatically logged and registered - model = LogisticRegression(max_iter=200) - model.fit(X_train, y_train) - - # Verify accuracy - accuracy = model.score(X_test, y_test) - print(f"Model accuracy: {accuracy:.2%}") - print(f"Registered to: {catalog}.{schema}.{model_name}") - ``` - - Key points: - - `mlflow.sklearn.autolog()` handles logging automatically - - `registered_model_name` with 3-level path registers to Unity Catalog - - `log_input_examples=True` helps with debugging and signature inference - execution_success: true - expectations: - expected_facts: - - "mlflow.sklearn.autolog" - - "registered_model_name" - - "Unity Catalog" - expected_patterns: - - pattern: "mlflow\\.sklearn\\.autolog\\(" - min_count: 1 - description: "sklearn autolog" - - pattern: "registered_model_name.*=.*f?[\"'].*\\..*\\." - min_count: 1 - description: "3-level UC path" - guidelines: - - "Must use mlflow.sklearn.autolog() for sklearn models" - - "Must register to Unity Catalog with 3-level path (catalog.schema.model)" - - "Should use log_input_examples=True for better debugging" - metadata: - category: "happy_path" - difficulty: "easy" - source: "cursor_generated" - tags: ["classical-ml", "sklearn", "autolog"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Custom PyFunc - # ============================================================================ - - id: "ms_custom_pyfunc_001" - inputs: - prompt: "Create a custom pyfunc model with preprocessing" - outputs: - response: | - Here's a custom pyfunc model with preprocessing: - - ```python - import mlflow - import pandas as pd - from mlflow.models import infer_signature - - - class CustomModelWithPreprocessing(mlflow.pyfunc.PythonModel): - """Custom model that includes preprocessing logic.""" - - def load_context(self, context): - """Load artifacts when model is loaded.""" - import pickle - - # Load preprocessor - with open(context.artifacts["preprocessor"], "rb") as f: - self.preprocessor = pickle.load(f) - - # Load trained model - with open(context.artifacts["model"], "rb") as f: - self.model = pickle.load(f) - - def predict(self, context, model_input: pd.DataFrame) -> pd.DataFrame: - """Preprocess input and run prediction.""" - # Apply preprocessing - processed = self.preprocessor.transform(model_input) - - # Run prediction - predictions = self.model.predict(processed) - - return pd.DataFrame({"prediction": predictions}) - - - # Log the model with artifacts - with mlflow.start_run(): - mlflow.pyfunc.log_model( - artifact_path="model", - python_model=CustomModelWithPreprocessing(), - artifacts={ - "preprocessor": "artifacts/preprocessor.pkl", - "model": "artifacts/model.pkl" - }, - pip_requirements=["scikit-learn==1.3.0", "pandas"], - registered_model_name="main.models.custom_preprocessor" - ) - ``` - - Key points: - - Subclass `mlflow.pyfunc.PythonModel` - - Implement `load_context()` to load artifacts - - Use `mlflow.pyfunc.log_model()` with `artifacts` dict - - Always specify `pip_requirements` for dependencies - execution_success: true - expectations: - expected_facts: - - "mlflow.pyfunc.PythonModel" - - "load_context" - - "artifacts" - expected_patterns: - - pattern: "class.*\\(mlflow\\.pyfunc\\.PythonModel\\)" - min_count: 1 - description: "PythonModel subclass" - - pattern: "def load_context\\(self, context\\)" - min_count: 1 - description: "load_context method" - - pattern: "mlflow\\.pyfunc\\.log_model\\(" - min_count: 1 - description: "log_model call" - guidelines: - - "Must subclass mlflow.pyfunc.PythonModel" - - "Must implement load_context for loading artifacts" - - "Must specify pip_requirements for dependencies" - metadata: - category: "happy_path" - difficulty: "medium" - source: "cursor_generated" - tags: ["custom-pyfunc", "preprocessing", "artifacts"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Deployment - Job-Based - # ============================================================================ - - id: "ms_deployment_job_001" - inputs: - prompt: "Deploy a GenAI agent using job-based async deployment" - outputs: - response: | - For GenAI agents, use job-based deployment to avoid MCP timeouts (~15 min deployment time). - - **Step 1: Create deployment script (`deploy_agent.py`)** - - ```python - # deploy_agent.py - import sys - from databricks import agents - - model_name = sys.argv[1] if len(sys.argv) > 1 else "main.agents.my_agent" - version = sys.argv[2] if len(sys.argv) > 2 else "1" - - print(f"Deploying {model_name} version {version}...") - - deployment = agents.deploy( - model_name, - version, - tags={"source": "mcp", "environment": "dev"} - ) - - print(f"Deployment complete!") - print(f"Endpoint: {deployment.endpoint_name}") - ``` - - **Step 2: Create the deployment job** - - ```python - # Use MCP tool - create_job( - name="deploy-agent-job", - tasks=[ - { - "task_key": "deploy", - "spark_python_task": { - "python_file": "/Workspace/Users/you@company.com/deploy_agent.py", - "parameters": ["{{job.parameters.model_name}}", "{{job.parameters.version}}"] - } - } - ], - parameters=[ - {"name": "model_name", "default": "main.agents.my_agent"}, - {"name": "version", "default": "1"} - ] - ) - ``` - - **Step 3: Trigger deployment (returns immediately)** - - ```python - run_job_now( - job_id=12345, - job_parameters={"model_name": "main.agents.my_agent", "version": "2"} - ) - ``` - - **Step 4: Check endpoint status** +metadata: + skill_name: databricks-model-serving + version: 0.1.0 + created_at: '2026-02-23T08:00:00.000000' - ```python - get_serving_endpoint_status(name="agents_main-agents-my_agent") - ``` - - Key points: - - `databricks.agents.deploy()` handles endpoint creation/update - - Job-based approach avoids MCP tool timeouts - - Deployment takes ~15 minutes to complete - execution_success: true - expectations: - expected_facts: - - "agents.deploy" - - "job-based" - - "create_job" - - "run_job_now" - expected_patterns: - - pattern: "agents\\.deploy\\(" - min_count: 1 - description: "agents.deploy call" - - pattern: "create_job\\(" - min_count: 1 - description: "job creation" - - pattern: "run_job_now\\(" - min_count: 1 - description: "async job run" - guidelines: - - "Must use job-based deployment for GenAI agents (avoids timeout)" - - "Must use databricks.agents.deploy() not synchronous SDK calls" - - "Should show complete workflow: create job, run job, check status" - metadata: - category: "happy_path" - difficulty: "medium" - source: "cursor_generated" - tags: ["deployment", "job-based", "async", "genai"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Querying Endpoints - # ============================================================================ - - id: "ms_query_endpoint_001" - inputs: - prompt: "Query a deployed ML endpoint with dataframe records" - outputs: - response: | - Here's how to query a deployed ML endpoint: - - **Via MCP Tool:** - - ```python - # Using MCP tool (pseudo-code) - # query_serving_endpoint( - # name="sklearn-classifier", - # dataframe_records=[ - # {"age": 25, "income": 50000, "credit_score": 720}, - # {"age": 35, "income": 75000, "credit_score": 680} - # ] - # ) - ``` - - **Via Python SDK:** - - ```python - from databricks.sdk import WorkspaceClient - - w = WorkspaceClient() - - # Query the endpoint - response = w.serving_endpoints.query( - name="sklearn-classifier", - dataframe_records=[ - {"age": 25, "income": 50000, "credit_score": 720}, - {"age": 35, "income": 75000, "credit_score": 680} - ] - ) - - # Access predictions - print(response.predictions) # [0.85, 0.72] - ``` - - Key points: - - Use `dataframe_records` for ML models (not `messages`) - - Response contains `predictions` array - - Check endpoint status with `get_serving_endpoint_status` first - execution_success: true - expectations: - expected_facts: - - "dataframe_records" - - "query_serving_endpoint" - - "predictions" - expected_patterns: - - pattern: "dataframe_records\\s*=\\s*\\[" - min_count: 1 - description: "dataframe_records format" - - pattern: "(query_serving_endpoint|serving_endpoints\\.query)\\(" - min_count: 1 - description: "query method" - guidelines: - - "Must use dataframe_records for ML models, not messages" - - "Should show MCP tool and/or SDK usage" - - "Should mention checking endpoint status first" - metadata: - category: "happy_path" - difficulty: "easy" - source: "cursor_generated" - tags: ["querying", "ml-endpoint", "dataframe-records"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - - id: "ms_query_chat_endpoint_001" - inputs: - prompt: "Query a chat/agent endpoint with messages" - outputs: - response: | - Here's how to query a chat/agent endpoint: - - **Via MCP Tool:** - - ```python - # MCP tool call - query_serving_endpoint( - name="my-agent-endpoint", - messages=[ - {"role": "user", "content": "What is Databricks?"} - ], - max_tokens=500, - temperature=0.7 - ) - ``` - - **Via Python SDK:** - - ```python - from databricks.sdk import WorkspaceClient - - w = WorkspaceClient() - - response = w.serving_endpoints.query( - name="my-agent-endpoint", - messages=[ - {"role": "user", "content": "What is Databricks?"} - ], - max_tokens=500 - ) - - # Access the response - print(response.choices[0].message.content) - ``` - - **Streaming Response:** - - ```python - for chunk in w.serving_endpoints.query( - name="my-agent-endpoint", - messages=[{"role": "user", "content": "Tell me a story"}], - stream=True - ): - if chunk.choices: - print(chunk.choices[0].delta.content, end="") - ``` - - Key points: - - Use `messages` for chat/agent endpoints (not `dataframe_records`) - - Response follows OpenAI chat completion structure - - Use `stream=True` for streaming responses - execution_success: true - expectations: - expected_facts: - - "messages" - - "query_serving_endpoint" - - "choices" - expected_patterns: - - pattern: "messages.*\\[.*\\{.*role.*content.*\\}.*\\]" - min_count: 1 - description: "messages format" - - pattern: "(query_serving_endpoint|serving_endpoints\\.query)\\(" - min_count: 1 - description: "query method" - guidelines: - - "Must use messages format for chat/agent endpoints" - - "Should show response structure with choices" - - "May show streaming option" - metadata: - category: "happy_path" - difficulty: "easy" - source: "cursor_generated" - tags: ["querying", "chat-endpoint", "agent-endpoint"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Logging and Registration - # ============================================================================ - - id: "ms_log_agent_001" - inputs: - prompt: "Log a GenAI agent to Unity Catalog with proper resources" - outputs: - response: | - Here's how to log a GenAI agent to Unity Catalog: - - ```python - import mlflow - from mlflow.models.resources import DatabricksServingEndpoint - - # Configuration - LLM_ENDPOINT = "databricks-meta-llama-3-3-70b-instruct" - - # Set registry to Unity Catalog - mlflow.set_registry_uri("databricks-uc") - - # Define resources for automatic passthrough authentication - resources = [ - DatabricksServingEndpoint(endpoint_name=LLM_ENDPOINT) - ] - - with mlflow.start_run(): - model_info = mlflow.pyfunc.log_model( - artifact_path="agent", - python_model="agent.py", # File path, not class instance - resources=resources, - pip_requirements=[ - "mlflow==3.6.0", - "databricks-langchain", - "langgraph==0.3.4", - ], - input_example={ - "input": [{"role": "user", "content": "Hello!"}] - }, - registered_model_name="main.agents.my_agent" - ) - - print(f"Model URI: {model_info.model_uri}") - ``` - - Key points: - - Use `mlflow.set_registry_uri("databricks-uc")` for Unity Catalog - - Specify `resources` for automatic passthrough authentication - - Use file path (`python_model="agent.py"`) for file-based logging - - Include exact `pip_requirements` versions - - Use 3-level namespace: `catalog.schema.model_name` - execution_success: true - expectations: - expected_facts: - - "mlflow.pyfunc.log_model" - - "resources" - - "DatabricksServingEndpoint" - - "registered_model_name" - expected_patterns: - - pattern: "mlflow\\.pyfunc\\.log_model\\(" - min_count: 1 - description: "log_model call" - - pattern: "DatabricksServingEndpoint\\(" - min_count: 1 - description: "resource specification" - - pattern: "registered_model_name.*=.*[\"'].*\\..*\\." - min_count: 1 - description: "3-level UC path" - guidelines: - - "Must specify resources for LLM endpoints (auto passthrough)" - - "Must register to Unity Catalog with 3-level path" - - "Must specify pip_requirements with exact versions" - metadata: - category: "happy_path" - difficulty: "medium" - source: "cursor_generated" - tags: ["logging", "registration", "unity-catalog", "resources"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" - - # ============================================================================ - # Error Cases / Common Issues - # ============================================================================ - - id: "ms_error_raw_dict_001" - inputs: - prompt: "What's wrong with this ResponsesAgent that returns raw dicts?" - outputs: - response: | - The common mistake is using raw dicts in ResponsesAgentResponse output instead of helper methods. - - **WRONG - raw dicts cause errors:** - - ```python - class BadAgent(ResponsesAgent): - def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse: - # ERROR: Raw dicts cause "Invalid output format" errors! - return ResponsesAgentResponse( - output=[{"role": "assistant", "content": "Hello!"}] - ) - ``` - - **CORRECT - use helper methods:** - - ```python - class GoodAgent(ResponsesAgent): - def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse: - # MUST use helper methods for output items - return ResponsesAgentResponse( - output=[self.create_text_output_item(text="Hello!", id="msg_1")] - ) - ``` - - **Available helper methods:** - - `self.create_text_output_item(text, id)` - for text responses - - `self.create_function_call_item(id, call_id, name, arguments)` - for tool calls - - `self.create_function_call_output_item(call_id, output)` - for tool results - - This is a critical pattern - raw dicts will cause deployment failures or runtime errors! - execution_success: true - expectations: - expected_facts: - - "create_text_output_item" - - "helper methods" - - "raw dicts" - expected_patterns: - - pattern: "self\\.create_text_output_item\\(" - min_count: 1 - description: "correct helper method" - - pattern: "(WRONG|ERROR|don't work)" - min_count: 1 - description: "identifies the error" - guidelines: - - "Must clearly identify that raw dicts are wrong" - - "Must show correct pattern with helper methods" - - "Should list all available helper methods" - metadata: - category: "error_handling" - difficulty: "medium" - source: "cursor_generated" - tags: ["error", "responses-agent", "common-issue"] - execution_verified: - mode: databricks - verified_date: "2026-02-09" - cluster_id: "0709-132523-cnhxf2p6" +test_cases: +- id: ms_responses_agent_001 + inputs: + prompt: Create a simple ResponsesAgent that echoes user messages + outputs: + response: | + Provides a complete ResponsesAgent class with the correct imports and predict method signature. The predict method extracts the user's last message and returns it as an echo wrapped in the agent's response type (NOT a raw dict). Confirms that the returned object conforms to the ResponsesAgent contract so downstream serving and logging accept it. May briefly note when ResponsesAgent is the right choice vs ChatAgent. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_classical_ml_001 + inputs: + prompt: Train and deploy an sklearn classifier with autolog to Unity Catalog + outputs: + response: | + Walks through training an sklearn classifier with mlflow.sklearn.autolog enabled, then registers the trained model to Unity Catalog using the 3-level namespace (catalog.schema.model_name). Confirms model creation with a signature and input example, then creates a Model Serving endpoint backed by the registered model version. Mentions the standard autolog behaviors (params/metrics/artifacts captured automatically) and how to verify the endpoint is ready before sending traffic. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_custom_pyfunc_001 + inputs: + prompt: Create a custom pyfunc model with preprocessing + outputs: + response: | + Defines a subclass of mlflow.pyfunc.PythonModel that wraps a preprocessing transformation plus an inner predictor. Implements load_context (to load artifacts) and predict (to apply preprocessing before scoring). Logs the model with mlflow.pyfunc.log_model including a signature, input example, and explicit pip dependencies; registers it to a 3-level Unity Catalog name; and confirms the model loads correctly when loaded back via mlflow.pyfunc.load_model. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_deployment_job_001 + inputs: + prompt: Deploy a GenAI agent using job-based async deployment + outputs: + response: | + Outlines the asynchronous deployment workflow for a GenAI agent: package the agent code and any required resources, kick off a Databricks Job that handles the long-running deployment, monitor the job to completion (with reasonable polling and a timeout), and then verify the resulting Model Serving endpoint is in a ready state before sending traffic. Notes where to find deployment logs and how to detect a failed deployment (job task failure, endpoint state not transitioning to READY). + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_query_endpoint_001 + inputs: + prompt: Query a deployed ML endpoint with dataframe records + outputs: + response: | + Sends a request to the deployed Model Serving endpoint using the dataframe_records (or dataframe_split) payload shape, includes the workspace authentication token, parses the JSON response body, and surfaces the predictions array. Notes common failure modes (401 missing/expired token, 404 endpoint name typo, 503 endpoint still scaling) and how to retry safely. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_query_chat_endpoint_001 + inputs: + prompt: Query a chat/agent endpoint with messages + outputs: + response: | + Sends a chat-shaped request to a chat or agent endpoint with a messages array containing role/content pairs, parses the chat-shaped response (e.g. choices[0].message.content), and confirms the assistant reply. Mentions streaming via Server-Sent Events when stream=true is set, and notes that agent endpoints may return tool-call events that need to be handled separately from the final assistant message. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_log_agent_001 + inputs: + prompt: Log a GenAI agent to Unity Catalog with proper resources + outputs: + response: | + Logs the GenAI agent (ChatAgent or ResponsesAgent) to MLflow under a 3-level Unity Catalog name. Crucially includes a `resources` list naming every dependency the agent needs at serve time — Unity Catalog functions, Genie Spaces, Vector Search indexes, downstream served model endpoints — so they are materialized as resource dependencies. Includes pip_requirements (or a requirements file) and code_paths for additional modules. Confirms a successful registration and a runnable mlflow.pyfunc.load_model round-trip. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: happy_path + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' + +- id: ms_error_raw_dict_001 + inputs: + prompt: What's wrong with this ResponsesAgent that returns raw dicts? + outputs: + response: | + Identifies the root cause: returning a raw dict from a ResponsesAgent's predict method violates the agent response contract. The predict method must return an instance of the proper agent response type, with items shaped according to its schema. Explains the fix (wrap the response in the response type and populate the items list correctly) and points out that raw dicts may appear to work in local prototypes but fail at the serving boundary or in downstream consumers that validate the response shape. + execution_success: true + expectations: + expected_facts: [] + expected_patterns: [] + guidelines: [] + metadata: + category: edge_case + source: auto_generated + created_at: '2026-02-23T08:00:00.000000' diff --git a/.test/skills/databricks-vector-search/ground_truth.yaml b/.test/skills/databricks-vector-search/ground_truth.yaml index ebdcacac..e79bcb74 100644 --- a/.test/skills/databricks-vector-search/ground_truth.yaml +++ b/.test/skills/databricks-vector-search/ground_truth.yaml @@ -9,405 +9,109 @@ test_cases: prompt: "How do I create a vector search endpoint in Databricks?" outputs: response: | - ```python - from databricks.sdk import WorkspaceClient - - w = WorkspaceClient() - - # Create a standard endpoint - endpoint = w.vector_search_endpoints.create_endpoint( - name="my-vs-endpoint", - endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED" - ) - # Endpoint creation is asynchronous; check status with get_endpoint() - ``` - - Two endpoint types are available: - - **STANDARD**: Best for low-latency queries - - **STORAGE_OPTIMIZED**: Best for large-scale, cost-effective deployments + Confirms creation of a Vector Search endpoint with a chosen name and endpoint type (STANDARD for low-latency queries, STORAGE_OPTIMIZED for large-scale cost-effective deployments). Notes that endpoint creation is asynchronous and that subsequent operations should wait until the endpoint reaches its ready state. execution_success: true expectations: - expected_facts: - - "Endpoints are created with w.vector_search_endpoints.create_endpoint()" - - "Two endpoint types: STANDARD and STORAGE_OPTIMIZED" - - "Endpoint creation is asynchronous" - expected_patterns: - - pattern: "create_endpoint" - min_count: 1 - description: "Uses create_endpoint() method" - - pattern: "STANDARD" - min_count: 1 - description: "Shows STANDARD endpoint type" - - pattern: "STORAGE_OPTIMIZED" - min_count: 1 - description: "Shows STORAGE_OPTIMIZED endpoint type" - guidelines: - - "Must show both STANDARD and STORAGE_OPTIMIZED endpoint types" - - "Must mention that endpoint creation is asynchronous" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: easy source: auto_generated + created_at: '2026-02-23T08:00:00.000000' - id: vs_delta_sync_managed_002 inputs: prompt: "How do I create a Delta Sync index with managed embeddings?" outputs: response: | - ```python - # Source table must have: primary key column + text column - index = w.vector_search_indexes.create_index( - name="catalog.schema.my_index", - endpoint_name="my-vs-endpoint", - primary_key="id", - index_type="DELTA_SYNC", - delta_sync_index_spec={ - "source_table": "catalog.schema.documents", - "embedding_source_columns": [ - { - "name": "content", - "embedding_model_endpoint_name": "databricks-gte-large-en" - } - ], - "pipeline_type": "TRIGGERED" # or "CONTINUOUS" - } - ) - ``` - - Databricks automatically embeds the text column using the specified model. Pipeline types: TRIGGERED (manual sync) or CONTINUOUS (auto-sync on source changes). + Confirms creation of a Delta Sync index over a source Delta table with managed embeddings: the source must declare a primary key and a text column, and the index uses a Databricks-hosted embedding model endpoint to embed that column automatically. Notes the choice between TRIGGERED (manual sync) and CONTINUOUS (auto-sync on source changes) pipeline types, and confirms the index is created against the named Vector Search endpoint. execution_success: true expectations: - expected_facts: - - "Delta Sync index automatically syncs from a source Delta table" - - "Managed embeddings use embedding_source_columns with a model endpoint" - - "TRIGGERED pipeline syncs on demand; CONTINUOUS syncs automatically" - - "databricks-gte-large-en is a built-in embedding model" - expected_patterns: - - pattern: "create_index" - min_count: 1 - description: "Uses create_index() method" - - pattern: "DELTA_SYNC" - min_count: 1 - description: "Specifies Delta Sync index type" - - pattern: "embedding_source_columns" - min_count: 1 - description: "Configures managed embedding columns" - - pattern: "databricks-gte-large-en" - min_count: 1 - description: "Uses built-in embedding model" - guidelines: - - "Must show the full create_index call with delta_sync_index_spec" - - "Must explain TRIGGERED vs CONTINUOUS pipeline types" - - "Must specify a Databricks embedding model endpoint" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: auto_generated + created_at: '2026-02-23T08:00:00.000000' - id: vs_delta_sync_self_managed_003 inputs: prompt: "How do I create a Delta Sync index with pre-computed embeddings?" outputs: response: | - ```python - # Source table must have: primary key + embedding vector column - index = w.vector_search_indexes.create_index( - name="catalog.schema.my_index", - endpoint_name="my-vs-endpoint", - primary_key="id", - index_type="DELTA_SYNC", - delta_sync_index_spec={ - "source_table": "catalog.schema.documents", - "embedding_vector_columns": [ - { - "name": "embedding", - "embedding_dimension": 768 - } - ], - "pipeline_type": "TRIGGERED" - } - ) - ``` - - Use self-managed embeddings when you need a specific embedding model not available as a Databricks endpoint, or when you pre-compute embeddings in a pipeline. + Confirms creation of a Delta Sync index using pre-computed (self-managed) embeddings: the source table must already contain an embedding vector column whose dimension matches what's declared in the index spec. Distinguishes this from the managed-embeddings flow (no embedding_source_columns) and notes when to choose self-managed (custom model not served on Databricks, or embeddings already produced by an upstream pipeline). execution_success: true expectations: - expected_facts: - - "Self-managed embeddings use embedding_vector_columns instead of embedding_source_columns" - - "embedding_dimension must match the dimension of your pre-computed vectors" - - "The source table must already contain an embedding column" - expected_patterns: - - pattern: "embedding_vector_columns" - min_count: 1 - description: "Uses self-managed embedding columns" - - pattern: "embedding_dimension" - min_count: 1 - description: "Specifies embedding dimension" - - pattern: "DELTA_SYNC" - min_count: 1 - description: "Uses Delta Sync index type" - guidelines: - - "Must use embedding_vector_columns (not embedding_source_columns)" - - "Must include embedding_dimension parameter" - - "Must explain when to use self-managed vs managed embeddings" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: auto_generated + created_at: '2026-02-23T08:00:00.000000' - id: vs_query_index_004 inputs: prompt: "How do I query a vector search index and get similarity results?" outputs: response: | - ```python - results = w.vector_search_indexes.query_index( - index_name="catalog.schema.my_index", - columns=["id", "content", "metadata"], - query_text="What is machine learning?", - num_results=5 - ) - - for doc in results.result.data_array: - score = doc[-1] # Similarity score is last column - print(f"Score: {score}, Content: {doc[1][:100]}...") - ``` - - For managed embedding indexes, use `query_text`. For self-managed, use `query_vector` with a pre-computed embedding. + Returns the top-N most similar documents from the index given a query, requesting specific result columns and a result count. For a managed-embedding index, the query is supplied as text and the index does the embedding internally; for self-managed, the caller provides the query vector. Confirms how to read the result rows and the per-row similarity score from the response. execution_success: true expectations: - expected_facts: - - "query_index() searches for similar documents by text or vector" - - "query_text is used for managed embedding indexes" - - "query_vector is used for self-managed embedding indexes" - - "Similarity score is the last column in the result array" - expected_patterns: - - pattern: "query_index" - min_count: 1 - description: "Uses query_index() method" - - pattern: "query_text" - min_count: 1 - description: "Uses text-based query" - - pattern: "num_results" - min_count: 1 - description: "Specifies number of results" - - pattern: "data_array" - min_count: 1 - description: "Accesses result data array" - guidelines: - - "Must show query_text for managed embeddings" - - "Must mention query_vector as alternative for self-managed" - - "Must show how to access similarity scores" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: easy source: auto_generated + created_at: '2026-02-23T08:00:00.000000' - id: vs_query_with_filters_005 inputs: prompt: "How do I apply metadata filters when querying a vector search index?" outputs: response: | - ```python - results = w.vector_search_indexes.query_index( - index_name="catalog.schema.my_index", - columns=["id", "content", "category"], - query_text="machine learning best practices", - num_results=10, - filters_json='{"category": "technical", "status": "published"}' - ) - ``` - - The `filters_json` parameter accepts a JSON string of key-value pairs that filter results by metadata columns before similarity ranking. Only documents matching all filters are returned. + Demonstrates applying metadata filters to a vector search query so that only documents matching the filter conditions are considered before similarity ranking. Notes that filters use AND semantics by default across all conditions, and that filters are applied pre-ranking (so num_results applies to the post-filter set). Confirms the filtered subset and explains that filtering against a missing/null metadata column simply excludes those rows. execution_success: true expectations: - expected_facts: - - "filters_json accepts a JSON string for metadata filtering" - - "Filters are applied before similarity ranking" - - "All filter conditions must be satisfied (AND logic)" - expected_patterns: - - pattern: "filters_json" - min_count: 1 - description: "Uses filters_json parameter" - - pattern: "query_index" - min_count: 1 - description: "Uses query_index() method" - - pattern: "query_text" - min_count: 1 - description: "Combines text query with filters" - guidelines: - - "Must show filters_json parameter with a JSON string" - - "Must explain that filters are applied before similarity ranking" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: medium source: auto_generated + created_at: '2026-02-23T08:00:00.000000' - id: vs_cli_reference_006 inputs: prompt: "What CLI commands are available for managing Databricks vector search?" outputs: response: | - ```bash - # List endpoints - databricks vector-search endpoints list - - # Create endpoint - databricks vector-search endpoints create \ - --name my-endpoint \ - --endpoint-type STANDARD - - # List indexes on endpoint - databricks vector-search indexes list-indexes \ - --endpoint-name my-endpoint - - # Get index status - databricks vector-search indexes get-index \ - --index-name catalog.schema.my_index - - # Sync index (for TRIGGERED pipeline) - databricks vector-search indexes sync-index \ - --index-name catalog.schema.my_index - - # Delete index - databricks vector-search indexes delete-index \ - --index-name catalog.schema.my_index - ``` + Enumerates the Databricks CLI subcommands for managing Vector Search endpoints and indexes: listing, creating, getting, syncing (for TRIGGERED pipelines), and deleting both endpoints and indexes. Distinguishes endpoint-management commands from index-management commands and notes that sync-index is what manually triggers a TRIGGERED pipeline. execution_success: true expectations: - expected_facts: - - "Vector search CLI commands are under databricks vector-search" - - "sync-index triggers a manual sync for TRIGGERED pipeline indexes" - - "Endpoints and indexes are managed as separate subcommands" - expected_patterns: - - pattern: "databricks vector-search endpoints" - min_count: 1 - description: "Shows endpoint management commands" - - pattern: "databricks vector-search indexes" - min_count: 1 - description: "Shows index management commands" - - pattern: "sync-index" - min_count: 1 - description: "Shows manual sync command" - guidelines: - - "Must show both endpoint and index management commands" - - "Must include sync-index for triggered pipelines" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: easy source: auto_generated - -- id: vs_mcp_create_endpoint_008 - inputs: - prompt: "Use MCP tools to create a vector search endpoint called my-rag-endpoint" - outputs: - response: | - ```python - result = create_or_update_vs_endpoint( - name="my-rag-endpoint", - endpoint_type="STANDARD" - ) - ``` - - The `create_or_update_vs_endpoint` tool is idempotent — it returns the existing endpoint if one with that name already exists. - execution_success: true - expectations: - expected_facts: - - "create_or_update_vs_endpoint is the correct MCP tool name" - - "The tool is idempotent" - expected_patterns: - - pattern: "create_or_update_vs_endpoint" - min_count: 1 - description: "Uses correct MCP tool name (not create_vs_endpoint)" - guidelines: - - "Must use create_or_update_vs_endpoint, NOT create_vs_endpoint" - - "Must mention that the tool is idempotent" - metadata: - category: happy_path - difficulty: easy - source: manual - -- id: vs_mcp_manage_data_009 - inputs: - prompt: "How do I upsert documents into a Direct Access vector search index using MCP tools?" - outputs: - response: | - ```python - result = manage_vs_data( - index_name="catalog.schema.my_index", - operation="upsert", - inputs_json=[ - {"id": "doc1", "content": "Sample document", "embedding": [0.1, 0.2, ...]}, - {"id": "doc2", "content": "Another document", "embedding": [0.3, 0.4, ...]} - ] - ) - ``` - - Use `manage_vs_data` with `operation="upsert"` to insert or update vectors. Other supported operations: `"delete"`, `"scan"`, `"sync"`. - execution_success: true - expectations: - expected_facts: - - "manage_vs_data is the correct MCP tool for data operations" - - "operation parameter accepts upsert, delete, scan, sync" - - "inputs_json contains the vector data to upsert" - expected_patterns: - - pattern: "manage_vs_data" - min_count: 1 - description: "Uses manage_vs_data (not upsert_vs_data)" - - pattern: "upsert" - min_count: 1 - description: "Specifies upsert operation" - guidelines: - - "Must use manage_vs_data with operation='upsert', NOT upsert_vs_data" - - "Must mention other available operations (delete, scan, sync)" - metadata: - category: happy_path - difficulty: medium - source: manual + created_at: '2026-02-23T08:00:00.000000' - id: vs_embedding_models_007 inputs: prompt: "What embedding models are available for vector search indexes?" outputs: response: | - Databricks provides built-in embedding models for managed indexes: - - | Model | Dimensions | - |-------|-----------| - | `databricks-gte-large-en` | 1024 | - | `databricks-bge-large-en` | 1024 | - - ```python - # Use with managed embeddings in index creation - embedding_source_columns=[ - { - "name": "content", - "embedding_model_endpoint_name": "databricks-gte-large-en" - } - ] - ``` - - You can also use custom embedding models deployed as serving endpoints. + Lists Databricks-hosted built-in embedding model endpoints suitable for managed Vector Search indexes (e.g. databricks-gte-large-en and databricks-bge-large-en, each producing 1024-dimensional embeddings). Notes that any custom model deployed as a Model Serving endpoint can also be used, and that the chosen model's embedding dimension must match the index configuration. execution_success: true expectations: - expected_facts: - - "databricks-gte-large-en produces 1024-dimensional embeddings" - - "databricks-bge-large-en produces 1024-dimensional embeddings" - - "Custom embedding models can also be used via serving endpoints" - expected_patterns: - - pattern: "databricks-gte-large-en" - min_count: 1 - description: "Lists GTE embedding model" - - pattern: "databricks-bge-large-en" - min_count: 1 - description: "Lists BGE embedding model" - - pattern: "1024" - min_count: 1 - description: "Specifies embedding dimensions" - guidelines: - - "Must list at least two built-in embedding models with dimensions" - - "Must mention that custom models can also be used" + expected_facts: [] + expected_patterns: [] + guidelines: [] metadata: category: happy_path - difficulty: easy source: auto_generated + created_at: '2026-02-23T08:00:00.000000'