fix(chat): cross-model compatibility + model-sweep regression suite#17
Closed
DoyleDev wants to merge 8 commits into
Closed
fix(chat): cross-model compatibility + model-sweep regression suite#17DoyleDev wants to merge 8 commits into
DoyleDev wants to merge 8 commits into
Conversation
Three model-family bugs were live before this:
• Qwen/Llama/gpt-oss 400 with "unknown field stream_options".
Mason added stream_options.include_usage on every streaming
request to surface Anthropic cache stats, but those providers
reject the field.
• Gemini 400 with "Gemini models only support one system prompt".
Mason builds up to three system messages (skills manifest +
user prompt + tool-aware nudge); Gemini permits exactly one.
Other models tolerated multiple so we never noticed.
• Llama / Qwen 3 next 400 with "max_tokens cannot exceed N".
The 16384 cap was for Opus extended thinking; Llama caps at
8192 and Qwen 3 next at 10000.
Fixes:
• src/chat-shared.ts (new): extracted flattenContent,
applyAnthropicCaching, plus three new helpers — supportsStreamOptions
(allowlist Claude / GPT / Codex / o-series; explicitly exclude
gpt-oss), maxTokensFor (16384 for Claude, 10000 for Qwen 3 next,
8192 default), consolidateSystemMessages (collapse multiple
role:"system" into one with \n\n separator; universally
compatible and preserves cache_control behavior).
• src/main.ts: import from chat-shared, gate body.stream_options
on supportsStreamOptions, run consolidateSystemMessages before
applyAnthropicCaching, and use maxTokensFor for both chat-
completions branches.
Regression coverage:
• scripts/test-models.js (new): standalone Node sweep that mints
OAuth via the local databricks CLI, discovers every chat model
via /api/2.0/serving-endpoints, and runs three scenarios
(hello-no-tools, hello-with-tools, multi-system) against each.
Mirrors Mason's chat-handler logic by reusing the same helpers
from build/ts/chat-shared.js. Mirrors the chatLoop "promote to
responses when tools + responses-supported" rule so gpt-5-5
tool tests are skipped (Mason never sends them via chat
completions).
• npm run test:models — runs the sweep. --filter <substr> scopes
to a model subset; --profile <name> picks a non-DEFAULT
databrickscfg profile.
End-to-end result: 81/81 model+scenario combinations green on the
user's workspace (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3
next + 3.5, Gemma).
Co-authored-by: Isaac
Co-authored-by: Isaac
Qwen 3.5 122B, Gemini 2.5, and gpt-oss stream delta.content as an array of content parts; appending it to a string yielded literal "[object Object]" in the chat window. Coerce through flattenContent before accumulating/emitting. Sweep now parses SSE deltas the same way and fails on "[object Object]" in assembled output. Co-authored-by: Isaac
…r.ts resolveModelRouting (gateway/format resolution incl. tools->Responses promotion), executeToolCore (headless load_skill/builtin/MCP dispatch), and capToolResult move to src/agent-runner.ts so the upcoming workflow engine can reuse them. getAllToolDefs gains an optional allowlist param (narrowing only) for per-cell tool selection; renderQuestionCard gains an optional container. Chat behavior unchanged. Co-authored-by: Isaac
Visual node-based workflow designer (sidebar button above Profile, Cmd+D). Cells select a model + per-cell tool subset + prompt; flow edges pipe outputs downstream with labeled multi-input joins; dashed feedback edges create bounded revision loops driven by a forced route_output verdict tool on gate cells. Sequential execution through the existing chat IPC; budgets (40 inner / 5 feedback / 25 global) guard against runaway loops. Workflows persist to ~/.mason/workflows. Includes the Spec → Implement → Test → Review template. Co-authored-by: Isaac
5 tasks
…e pressure Loops were already bounded (5 revisions per cell, 25 cell runs global, gatekeeper forces a terminal route on exhaustion) but invisibly so. Now: feedback-target cells show an editable revision cap (1-20) on the card; status pills show loop progress (running 3/5); gates see 'revision N of M' annotations on revised inputs plus per-route budget state and an explicit don't-chase-perfection instruction. Verified against a critic prompted to never be satisfied: forced to route end after the cap, 4 cell runs. Co-authored-by: Isaac
Co-authored-by: Isaac
feat(designer): Agentic Workflow Designer
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final sweep result against the user's workspace: 81 passed, 0 failed (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3 next + 3.5, Gemma).
Test plan
This pull request and its description were written by Isaac.