Skip to content

feat(copilot): ground Ask Ontos in concept docs corpus (#280)#472

Open
mvkonchits-db wants to merge 7 commits into
mainfrom
feature/ask-ontos-uplift-pr1
Open

feat(copilot): ground Ask Ontos in concept docs corpus (#280)#472
mvkonchits-db wants to merge 7 commits into
mainfrom
feature/ask-ontos-uplift-pr1

Conversation

@mvkonchits-db
Copy link
Copy Markdown
Contributor

Summary

This is PR 1 of the Ask Ontos uplift (#280): a grounding & system-prompt foundation for the in-product copilot. PR 2 (Smart Copilot Insights, the original #280 scope) will land on top of this.

  • New docs/concepts/ corpus (14 files, ~3900 lines) — reference docs the LLM grounds in for "what is X?" / "how does Y work?" / "what's the difference between A and B?" questions. Covers roles & RBAC, ODPS/ODCS lifecycles, agreements, ontology + KG, data quality (incl. DQX flow), delivery modes vs methods, MCP, asset model, install + troubleshooting. Anonymized; aligned with the pitch deck + CUJ doc. Every major section has an explicit {#kebab-anchor} for stable citation.
  • New search_ontos_concepts tool (src/backend/src/tools/concepts.py) — grep-based search returning file.md#anchor citation URIs. Layout-agnostic path resolution (env-var override + walk-up handles both local-dev and deployed app layouts; degrades gracefully if corpus is absent).
  • System prompt rewrite + extraction (src/backend/src/controller/system_prompts.py) — tool-first policy for conceptual questions, refusal template, three-tier confidence labels (internal, stripped server-side), hidden citation discipline, vocabulary primer matched to the pitch deck + CUJ. Wires the LLM_SYSTEM_PROMPT env override that was previously dead code. get_system_prompt(settings, *, role, page_name, selected_entity, adoption_mode) accepts personalization slots for Phase 2/3 to fill.

What changed in llm_search_manager.py

  • Removed hardcoded ~140-line SYSTEM_PROMPT constant; now calls get_system_prompt(settings=...).
  • Strips internal grounding markers from user-visible response: <!-- ref: file.md#anchor --> citation comments AND [Confirmed]/[Documented]/[Inferred] confidence labels. The model still emits both (so it stratifies its grounding), but they're filtered before reaching the chat UI. Captured in debug_info["internal_citations"] and debug_info["confidence_labels"] for audit.
  • Added concepts category in QueryClassifier (in DEFAULT_CATEGORIES and ALWAYS_INCLUDED_CATEGORIES) so the new tool is offered on every conceptual question.

Test plan

  • 13 unit tests for SearchOntosConceptsTool (empty query, known concepts, multi-doc concepts, no-match, anchor extraction)
  • 6 integration tests for /api/llm-search/chat with the new tool + system prompt
  • Full unit suite passes (1011/1011); no regressions in pre-existing tests
  • Deployed end-to-end on a test workspace; conceptual / refusal / out-of-scope / live-data / install-troubleshooting queries verified — tool fires, citations and confidence labels stripped, refusal template fires when no doc or tool supports the answer, no question echo in responses
  • Reviewer to verify their local docs/concepts/ is picked up at runtime (set ONTOS_CONCEPTS_DIR if non-standard layout)
  • Reviewer to confirm LLM_SYSTEM_PROMPT override still works (was previously defined-but-not-consumed)

Deployment note

docs/concepts/ lives at the repo root, not under src/. A standard databricks sync from src/ will not package it. Either (a) upload the corpus separately (e.g. databricks workspace import-dir docs/concepts <target>/docs/concepts), or (b) set ONTOS_CONCEPTS_DIR to point at the deployed corpus location. Documented inline in installation-and-troubleshooting.md#corpus-not-found. A follow-up PR could fold the corpus into the bundle artifacts so this manual step goes away.

Out of scope (follow-on PRs)

  • App-state awareness (get_app_state tool, blank/active mode preamble) — Phase 2
  • Role/page injection (extend ChatMessageCreate, wire copilot-store to backend, inject into system prompt) — Phase 3
  • Smart Copilot Insights (the original [Feature]: Ask Ontos uplift — grounding, personalization, and smart insights #280 scope, depends on this PR's grounding) — Phase 4
  • Surfacing citations to end users — pending a docs-as-code review of docs/concepts/

This pull request and its description were written by Isaac.

Add 13 markdown files under docs/concepts/ that serve as the grounding
corpus for the Ask Ontos copilot. Covers:

- roles & RBAC + permission model
- data product / data contract lifecycles
- agreement workflow (workflow vs execution vs agreement)
- ontology and knowledge-graph model, semantic linking (three-tier)
- data quality + DQX integration end-to-end
- delivery modes vs delivery methods (disambiguated)
- MCP and Ask Ontos surfaces
- asset model
- personas quick-reference
- end-to-end flows (bottom-up UC -> catalog, top-down ontology -> assets)

Every major section carries an explicit {#kebab-anchor} so the copilot
can cite via search_ontos_concepts in a follow-up commit. Citations are
hidden from end-users in v1; the corpus is LLM grounding, not a
user-facing docs site.

Vocabulary aligned with the pitch deck + CUJ doc (ODPS v1.0.0, ODCS
v3.1.0). Forward-compatibility softening applied for several in-flight
PRs (versioning, Ontos admin decoupling, approver-role filter, etc.)
without naming them.

Co-authored-by: Isaac
Make the in-product copilot citation-anchor conceptual answers to the
new docs/concepts/ corpus.

- Add SearchOntosConceptsTool that walks docs/concepts/, parses sections
  by heading and {#anchor}, returns top-K excerpts ranked by title >
  anchor > body keyword frequency. Each match returns file, anchor,
  title, excerpt, source_uri (file.md#anchor).
- Add 'concepts' query-classifier category in DEFAULT_CATEGORIES and
  ALWAYS_INCLUDED_CATEGORIES so the tool is offered on every conceptual
  question.
- Extract hardcoded SYSTEM_PROMPT into a new
  controller/system_prompts.py module exposing get_system_prompt() with
  personalization slots (role, page_name, selected_entity,
  adoption_mode) for Phase 2/3 to fill. v1 ignores the slots.
- Honor LLM_SYSTEM_PROMPT env override (previously defined in Settings
  but never consumed).
- New default system prompt: vocabulary primer aligned with the pitch
  deck + CUJ doc, tool-first policy for conceptual questions, three-tier
  confidence labels ([Confirmed]/[Documented]/[Inferred]), hidden
  citation discipline, strict refusal template, out-of-scope deflection.

Tests:
- 13 unit tests for SearchOntosConceptsTool (empty query, known concept,
  multi-doc concept, no-match, anchor extraction)
- 6 integration tests for /api/llm-search/chat with the new tool +
  system prompt
- Full unit suite passes (1011/1011); no regressions

Co-authored-by: Isaac
The previous resolution walked exactly 5 parents above concepts.py to
find docs/concepts/. That assumed the local-dev layout (with src/ as a
wrapper) and silently broke in deployed Databricks Apps where src/ is
stripped (so the corpus lives 4 parents up, not 5).

Replace with:
- ONTOS_CONCEPTS_DIR env var override (explicit, takes precedence)
- Walk-up search across parents 2..6 looking for docs/concepts/
- Graceful None on miss (tool still returns success=True, empty matches)

Verified for both layouts:
- Local: <ontos>/src/backend/src/tools/concepts.py -> finds at parents[4]
- Deployed: <approot>/backend/src/tools/concepts.py -> finds at parents[3]

Co-authored-by: Isaac
…onse

The system prompt asks the model to anchor conceptual answers with
hidden HTML-comment citations (e.g. `<!-- ref: roles-and-rbac.md#... -->`)
so reviewers can audit grounding without exposing them to end users.
Most markdown renderers drop HTML comments on render, but the chat UI
surfaces them as visible text — which is what live E2E confirmed.

Add a server-side strip in LlmSearchManager:
- `_CITATION_COMMENT_RE` matches `<!-- ref: ... -->` (non-greedy)
- `_strip_internal_citations` returns (cleaned_text, [refs]) so debug_info
  retains the citations for audit while the user-facing response is clean
- Applied at the inner-loop final return; collapses any 3+ newlines
  created by the strip back to double

Citations remain accessible via `debug_info["internal_citations"]` when
the client sets `debug=True`.

Co-authored-by: Isaac
Add a 14th file covering install (Marketplace vs Git), update procedures,
maintenance, and common UI errors. 37 anchors so any specific error can
be cited. Topics:

- Distribution channels: Marketplace vs GitHub repo, when to choose which
- First install: prerequisites, first-admin bootstrap, demo presets
- Updates: Marketplace path, Git path, migration discipline (append-only,
  ≤32-char revision IDs), DB state vs code state
- Maintenance: alembic at startup, role re-seeding (first-start-only),
  workspace sync direction (from src/), OAuth scope-change cookie gotcha,
  customer fork hygiene
- UI errors users actually see:
  * Identity — Request role prompt, unexpected 403s, UC scope missing
  * Workflows — Cannot approve, grant_permissions failed (MANAGE required)
  * Database — Alembic version too long, Lakebase autoscale stuck, stale
    data after git revert
  * Deploy — Process did not start in 10 min, corpus not found

6 customer-voice "Common questions". Cross-references to roles-and-rbac,
agreement-workflow, delivery-and-propagation, mcp-and-ask-ontos. No
customer names, no internal ticket IDs.

README.md updated to 14 files; verification footer bumped to 2026-05-29.

Co-authored-by: Isaac
…ponse

The labels [Confirmed]/[Documented]/[Inferred] were emitted user-visible
per the v1 system prompt, but they expose grounding mechanics that don't
belong in the surfaced answer. Treat them the same way as citation
comments — emit them so the model still stratifies confidence and so
reviewers can audit grounding, but strip server-side before returning.

- Add `_CONFIDENCE_LABEL_RE` and extend `_strip_internal_citations` to a
  3-tuple return (cleaned_text, citations, confidence_labels)
- Surface both into debug_info (`internal_citations`,
  `confidence_labels`) so audit consumers can still see them via
  `debug=true`
- Update system prompt to declare the labels internal/stripped (so the
  model knows the act of stratifying matters even though they're hidden)

Co-authored-by: Isaac
The model was opening conceptual answers with a bolded restatement of
the user's question (e.g. **What is a Team?** followed by the answer).
That's redundant in the chat thread where the user already sees their
own question above, and reads as noise.

Update the Response format section to explicitly forbid:
- restating, echoing, or rephrasing the question
- bolded-question headers as openers
- "Great question!" / "Let me explain..." fillers

Begin with the answer directly.

Co-authored-by: Isaac
@mvkonchits-db mvkonchits-db requested a review from a team as a code owner May 29, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant