fix(chat): cross-model compatibility + model-sweep regression suite by DoyleDev · Pull Request #17 · databricks-solutions/mason

DoyleDev · 2026-06-09T20:55:09Z

Summary

Fix three live model-family bugs surfaced when switching between Anthropic / Gemini / Qwen / Llama / gpt-oss:
- Qwen, Llama, gpt-oss 400 with `unknown field "stream_options"` → gate `stream_options.include_usage` behind a Claude/GPT/Codex/o-series allowlist (and explicitly exclude `gpt-oss` since the Databricks open-weights proxy rejects it).
- Gemini 400 with `Gemini models only support one system prompt` → collapse multiple `role:"system"` messages into one (\n\n separator) before send; cache_control still latches onto "the last" system message since it's now the only one.
- Llama / Qwen 3 next 400 with `max_tokens cannot exceed N` → per-family cap helper (Claude 16384, Qwen3-next 10000, default 8192). The 16K cap was for Opus extended-thinking; other families don't allow it.
Add `npm run test:models` — a standalone cross-model regression sweep. Mints OAuth via the local Databricks CLI, discovers models via `/api/2.0/serving-endpoints`, and runs three scenarios per model (`hello-no-tools`, `hello-with-tools`, `multi-system`) through the same Gateway endpoints Mason uses. `--filter ` scopes to a subset, `--profile ` picks a non-DEFAULT profile.
Extract shared chat-handler helpers to `src/chat-shared.ts` so `main.ts` and `scripts/test-models.js` use the exact same logic (no drift). Helpers exported: `flattenContent`, `applyAnthropicCaching`, `supportsStreamOptions`, `maxTokensFor`, `consolidateSystemMessages`.
Test script mirrors Mason's chatLoop rule that promotes tools-bearing GPT-5.5 turns to Responses, so it doesn't false-positive on a scenario Mason's renderer would never send via chat completions.
1.4.4 release commit included so the in-app updater picks it up on the next tag.

Final sweep result against the user's workspace: 81 passed, 0 failed (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3 next + 3.5, Gemma).

Test plan

`npm run test:models` — full sweep green (already green on author's workspace).
`npm run test:models -- --filter gemini` — targeted run, multi-system scenario passes (was the original Gemini break).
`npm run test:models -- --filter qwen` — `hello-no-tools` passes (was the original Qwen `stream_options` break).
`npm start` — boot Electron, send "hello" to: Claude Opus 4.7, GPT-5.5 (no tools), Gemini Flash 2.5, Llama 4 Maverick, Qwen 3.5 122B. All succeed.
With one MCP server connected (so tools are attached), send "hello" to Claude Opus 4.7 (verifies cache_control still applies after consolidation) and to GPT-5.5 (verifies the tools auto-promotes to Responses).
Verify `[CHAT] Usage (streamed)` log line still appears on Claude turns and is absent on Qwen/Llama/gpt-oss turns.

This pull request and its description were written by Isaac.

Three model-family bugs were live before this: • Qwen/Llama/gpt-oss 400 with "unknown field stream_options". Mason added stream_options.include_usage on every streaming request to surface Anthropic cache stats, but those providers reject the field. • Gemini 400 with "Gemini models only support one system prompt". Mason builds up to three system messages (skills manifest + user prompt + tool-aware nudge); Gemini permits exactly one. Other models tolerated multiple so we never noticed. • Llama / Qwen 3 next 400 with "max_tokens cannot exceed N". The 16384 cap was for Opus extended thinking; Llama caps at 8192 and Qwen 3 next at 10000. Fixes: • src/chat-shared.ts (new): extracted flattenContent, applyAnthropicCaching, plus three new helpers — supportsStreamOptions (allowlist Claude / GPT / Codex / o-series; explicitly exclude gpt-oss), maxTokensFor (16384 for Claude, 10000 for Qwen 3 next, 8192 default), consolidateSystemMessages (collapse multiple role:"system" into one with \n\n separator; universally compatible and preserves cache_control behavior). • src/main.ts: import from chat-shared, gate body.stream_options on supportsStreamOptions, run consolidateSystemMessages before applyAnthropicCaching, and use maxTokensFor for both chat- completions branches. Regression coverage: • scripts/test-models.js (new): standalone Node sweep that mints OAuth via the local databricks CLI, discovers every chat model via /api/2.0/serving-endpoints, and runs three scenarios (hello-no-tools, hello-with-tools, multi-system) against each. Mirrors Mason's chat-handler logic by reusing the same helpers from build/ts/chat-shared.js. Mirrors the chatLoop "promote to responses when tools + responses-supported" rule so gpt-5-5 tool tests are skipped (Mason never sends them via chat completions). • npm run test:models — runs the sweep. --filter <substr> scopes to a model subset; --profile <name> picks a non-DEFAULT databrickscfg profile. End-to-end result: 81/81 model+scenario combinations green on the user's workspace (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3 next + 3.5, Gemma). Co-authored-by: Isaac

Co-authored-by: Isaac

Qwen 3.5 122B, Gemini 2.5, and gpt-oss stream delta.content as an array of content parts; appending it to a string yielded literal "[object Object]" in the chat window. Coerce through flattenContent before accumulating/emitting. Sweep now parses SSE deltas the same way and fails on "[object Object]" in assembled output. Co-authored-by: Isaac

…r.ts resolveModelRouting (gateway/format resolution incl. tools->Responses promotion), executeToolCore (headless load_skill/builtin/MCP dispatch), and capToolResult move to src/agent-runner.ts so the upcoming workflow engine can reuse them. getAllToolDefs gains an optional allowlist param (narrowing only) for per-cell tool selection; renderQuestionCard gains an optional container. Chat behavior unchanged. Co-authored-by: Isaac

Visual node-based workflow designer (sidebar button above Profile, Cmd+D). Cells select a model + per-cell tool subset + prompt; flow edges pipe outputs downstream with labeled multi-input joins; dashed feedback edges create bounded revision loops driven by a forced route_output verdict tool on gate cells. Sequential execution through the existing chat IPC; budgets (40 inner / 5 feedback / 25 global) guard against runaway loops. Workflows persist to ~/.mason/workflows. Includes the Spec → Implement → Test → Review template. Co-authored-by: Isaac

…e pressure Loops were already bounded (5 revisions per cell, 25 cell runs global, gatekeeper forces a terminal route on exhaustion) but invisibly so. Now: feedback-target cells show an editable revision cap (1-20) on the card; status pills show loop progress (running 3/5); gates see 'revision N of M' annotations on revised inputs plus per-route budget state and an explicit don't-chase-perfection instruction. Verified against a critic prompted to never be satisfied: forced to route end after the cap, 4 cell runs. Co-authored-by: Isaac

Co-authored-by: Isaac

feat(designer): Agentic Workflow Designer

DoyleDev added 5 commits June 9, 2026 15:53

chore(release): 1.4.4

b7bbf4e

Co-authored-by: Isaac

DoyleDev mentioned this pull request Jun 11, 2026

feat(designer): Agentic Workflow Designer #18

Merged

5 tasks

DoyleDev and others added 3 commits June 11, 2026 12:40

docs: add Workflow Designer section + screenshot to README

c68449d

Co-authored-by: Isaac

Merge pull request #18 from databricks-solutions/feat/workflow-designer

f071835

feat(designer): Agentic Workflow Designer

DoyleDev closed this Jun 11, 2026

DoyleDev mentioned this pull request Jun 11, 2026

Release v1.5.0 — Agentic Workflow Designer + cross-model compatibility #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chat): cross-model compatibility + model-sweep regression suite#17

fix(chat): cross-model compatibility + model-sweep regression suite#17
DoyleDev wants to merge 8 commits into
mainfrom
fix/cross-model-compat

DoyleDev commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DoyleDev commented Jun 9, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant