Skip to content

fix(chat): cross-model compatibility + model-sweep regression suite#17

Closed
DoyleDev wants to merge 8 commits into
mainfrom
fix/cross-model-compat
Closed

fix(chat): cross-model compatibility + model-sweep regression suite#17
DoyleDev wants to merge 8 commits into
mainfrom
fix/cross-model-compat

Conversation

@DoyleDev

@DoyleDev DoyleDev commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix three live model-family bugs surfaced when switching between Anthropic / Gemini / Qwen / Llama / gpt-oss:
    • Qwen, Llama, gpt-oss 400 with `unknown field "stream_options"` → gate `stream_options.include_usage` behind a Claude/GPT/Codex/o-series allowlist (and explicitly exclude `gpt-oss` since the Databricks open-weights proxy rejects it).
    • Gemini 400 with `Gemini models only support one system prompt` → collapse multiple `role:"system"` messages into one (\n\n separator) before send; cache_control still latches onto "the last" system message since it's now the only one.
    • Llama / Qwen 3 next 400 with `max_tokens cannot exceed N` → per-family cap helper (Claude 16384, Qwen3-next 10000, default 8192). The 16K cap was for Opus extended-thinking; other families don't allow it.
  • Add `npm run test:models` — a standalone cross-model regression sweep. Mints OAuth via the local Databricks CLI, discovers models via `/api/2.0/serving-endpoints`, and runs three scenarios per model (`hello-no-tools`, `hello-with-tools`, `multi-system`) through the same Gateway endpoints Mason uses. `--filter ` scopes to a subset, `--profile ` picks a non-DEFAULT profile.
  • Extract shared chat-handler helpers to `src/chat-shared.ts` so `main.ts` and `scripts/test-models.js` use the exact same logic (no drift). Helpers exported: `flattenContent`, `applyAnthropicCaching`, `supportsStreamOptions`, `maxTokensFor`, `consolidateSystemMessages`.
  • Test script mirrors Mason's chatLoop rule that promotes tools-bearing GPT-5.5 turns to Responses, so it doesn't false-positive on a scenario Mason's renderer would never send via chat completions.
  • 1.4.4 release commit included so the in-app updater picks it up on the next tag.

Final sweep result against the user's workspace: 81 passed, 0 failed (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3 next + 3.5, Gemma).

Test plan

  • `npm run test:models` — full sweep green (already green on author's workspace).
  • `npm run test:models -- --filter gemini` — targeted run, multi-system scenario passes (was the original Gemini break).
  • `npm run test:models -- --filter qwen` — `hello-no-tools` passes (was the original Qwen `stream_options` break).
  • `npm start` — boot Electron, send "hello" to: Claude Opus 4.7, GPT-5.5 (no tools), Gemini Flash 2.5, Llama 4 Maverick, Qwen 3.5 122B. All succeed.
  • With one MCP server connected (so tools are attached), send "hello" to Claude Opus 4.7 (verifies cache_control still applies after consolidation) and to GPT-5.5 (verifies the tools auto-promotes to Responses).
  • Verify `[CHAT] Usage (streamed)` log line still appears on Claude turns and is absent on Qwen/Llama/gpt-oss turns.

This pull request and its description were written by Isaac.

DoyleDev added 5 commits June 9, 2026 15:53
Three model-family bugs were live before this:

  • Qwen/Llama/gpt-oss 400 with "unknown field stream_options".
    Mason added stream_options.include_usage on every streaming
    request to surface Anthropic cache stats, but those providers
    reject the field.

  • Gemini 400 with "Gemini models only support one system prompt".
    Mason builds up to three system messages (skills manifest +
    user prompt + tool-aware nudge); Gemini permits exactly one.
    Other models tolerated multiple so we never noticed.

  • Llama / Qwen 3 next 400 with "max_tokens cannot exceed N".
    The 16384 cap was for Opus extended thinking; Llama caps at
    8192 and Qwen 3 next at 10000.

Fixes:

  • src/chat-shared.ts (new): extracted flattenContent,
    applyAnthropicCaching, plus three new helpers — supportsStreamOptions
    (allowlist Claude / GPT / Codex / o-series; explicitly exclude
    gpt-oss), maxTokensFor (16384 for Claude, 10000 for Qwen 3 next,
    8192 default), consolidateSystemMessages (collapse multiple
    role:"system" into one with \n\n separator; universally
    compatible and preserves cache_control behavior).
  • src/main.ts: import from chat-shared, gate body.stream_options
    on supportsStreamOptions, run consolidateSystemMessages before
    applyAnthropicCaching, and use maxTokensFor for both chat-
    completions branches.

Regression coverage:

  • scripts/test-models.js (new): standalone Node sweep that mints
    OAuth via the local databricks CLI, discovers every chat model
    via /api/2.0/serving-endpoints, and runs three scenarios
    (hello-no-tools, hello-with-tools, multi-system) against each.
    Mirrors Mason's chat-handler logic by reusing the same helpers
    from build/ts/chat-shared.js. Mirrors the chatLoop "promote to
    responses when tools + responses-supported" rule so gpt-5-5
    tool tests are skipped (Mason never sends them via chat
    completions).
  • npm run test:models — runs the sweep. --filter <substr> scopes
    to a model subset; --profile <name> picks a non-DEFAULT
    databrickscfg profile.

End-to-end result: 81/81 model+scenario combinations green on the
user's workspace (Claude, GPT, GPT-OSS, Gemini, Llama 3/4, Qwen 3
next + 3.5, Gemma).

Co-authored-by: Isaac
Co-authored-by: Isaac
Qwen 3.5 122B, Gemini 2.5, and gpt-oss stream delta.content as an array
of content parts; appending it to a string yielded literal
"[object Object]" in the chat window. Coerce through flattenContent
before accumulating/emitting. Sweep now parses SSE deltas the same way
and fails on "[object Object]" in assembled output.

Co-authored-by: Isaac
…r.ts

resolveModelRouting (gateway/format resolution incl. tools->Responses
promotion), executeToolCore (headless load_skill/builtin/MCP dispatch),
and capToolResult move to src/agent-runner.ts so the upcoming workflow
engine can reuse them. getAllToolDefs gains an optional allowlist param
(narrowing only) for per-cell tool selection; renderQuestionCard gains
an optional container. Chat behavior unchanged.

Co-authored-by: Isaac
Visual node-based workflow designer (sidebar button above Profile,
Cmd+D). Cells select a model + per-cell tool subset + prompt; flow
edges pipe outputs downstream with labeled multi-input joins; dashed
feedback edges create bounded revision loops driven by a forced
route_output verdict tool on gate cells. Sequential execution through
the existing chat IPC; budgets (40 inner / 5 feedback / 25 global)
guard against runaway loops. Workflows persist to ~/.mason/workflows.
Includes the Spec → Implement → Test → Review template.

Co-authored-by: Isaac
DoyleDev and others added 3 commits June 11, 2026 12:40
…e pressure

Loops were already bounded (5 revisions per cell, 25 cell runs global,
gatekeeper forces a terminal route on exhaustion) but invisibly so.
Now: feedback-target cells show an editable revision cap (1-20) on the
card; status pills show loop progress (running 3/5); gates see
'revision N of M' annotations on revised inputs plus per-route budget
state and an explicit don't-chase-perfection instruction. Verified
against a critic prompted to never be satisfied: forced to route end
after the cap, 4 cell runs.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant