microsoft · pwang347 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026 · Apr 14, 2026
diff --git a/.github/skills/chat-perf/SKILL.md b/.github/skills/chat-perf/SKILL.md
@@ -0,0 +1,173 @@
+---
+name: chat-perf
+description: Run chat perf benchmarks and memory leak checks against the local dev build or any published VS Code version. Use when investigating chat rendering regressions, validating perf-sensitive changes to chat UI, or checking for memory leaks in the chat response pipeline.
+---
+
+# Chat Performance Testing
+
+## When to use
+
+- Before/after modifying chat rendering code (`chatListRenderer.ts`, `chatInputPart.ts`, markdown rendering)
+- When changing the streaming response pipeline or SSE processing
+- When modifying disposable/lifecycle patterns in chat components
+- To compare performance between two VS Code releases
+- In CI to gate PRs that touch chat UI code
+
+## Quick start
+
+```bash
+# Run perf regression test (compares local dev build vs VS Code 1.115.0):
+npm run perf:chat -- --scenario text-only --runs 3
+
+# Run all scenarios with no baseline (just measure):
+npm run perf:chat -- --no-baseline --runs 3
+
+# Run memory leak check (10 messages in one session):
+npm run perf:chat-leak
+
+# Run leak check with more messages for accuracy:
+npm run perf:chat-leak -- --messages 20 --verbose
+```
+
+## Perf regression test
+
+**Script:** `scripts/chat-simulation/test-chat-perf-regression.js`
+**npm:** `npm run perf:chat`
+
+Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--runs <n>` | `5` | Runs per scenario. More = more stable. Use 5+ for CI. |
+| `--scenario <id>` / `-s` | all | Scenario to test (repeatable). See `common/perf-scenarios.js`. |
+| `--build <path\|ver>` / `-b` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`, commit hash). |
+| `--baseline <path>` | — | Compare against a previously saved baseline JSON file. |
+| `--baseline-build <ver>` | `1.115.0` | Version to download and benchmark as baseline. |
+| `--no-baseline` | — | Skip baseline comparison entirely. |
+| `--save-baseline` | — | Save results as the new baseline (requires `--baseline <path>`). |
+| `--resume <path>` | — | Resume a previous run, adding more iterations to increase confidence. |
+| `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). |
+| `--no-cache` | — | Ignore cached baseline data, always run fresh. |
+| `--ci` | — | CI mode: write Markdown summary to `ci-summary.md` (implies `--no-cache`). |
+| `--verbose` | — | Print per-run details including response content. |
+
+### Comparing two remote builds
+
+```bash
+# Compare 1.110.0 against 1.115.0 (no local build needed):
+npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
+```
+
+### Resuming a run for more confidence
+
+When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:
+
+```bash
+# Initial run with 3 iterations — may be inconclusive:
+npm run perf:chat -- --scenario text-only --runs 3
+
+# Add 3 more runs to the same results file (both test + baseline):
+npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 3
+
+# Keep adding until confidence is reached:
+npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 5
+```
+
+`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.
+
+### Statistical significance
+
+Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.
+
+With typical variance (cv ≈ 20%), you need:
+- **n ≥ 5** per build to detect a 35% regression at 95% confidence
+- **n ≥ 10** per build to detect a 20% regression reliably
+
+Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`.
+
+### Exit codes
+
+- `0` — all metrics within threshold, or exceeding threshold but not statistically significant
+- `1` — statistically significant regression detected, or all runs failed
+
+### Scenarios
+
+Scenarios are defined in `scripts/chat-simulation/common/perf-scenarios.js` and registered via `registerPerfScenarios()`. There are three categories:
+
+- **Content-only** — plain streaming responses (e.g. `text-only`, `large-codeblock`, `rapid-stream`)
+- **Tool-call** — multi-turn scenarios with tool invocations (e.g. `tool-read-file`, `tool-edit-file`)
+- **Multi-turn user** — multi-turn conversations with user follow-ups, thinking blocks (e.g. `thinking-response`, `multi-turn-user`, `long-conversation`)
+
+Run `npm run perf:chat -- --help` to see the full list of registered scenario IDs.
+
+### Metrics collected
+
+- **Timing:** time to first token, time to complete (prefers internal `code/chat/*` perf marks, falls back to client-side measurement)
+- **Rendering:** layout count, style recalculation count, forced reflows, long tasks (>50ms)
+- **Memory:** heap before/after (informational, noisy for single requests)
+
+### Statistics
+
+Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.
+
+## Memory leak check
+
+**Script:** `scripts/chat-simulation/test-chat-mem-leaks.js`
+**npm:** `npm run perf:chat-leak`
+
+Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
+
+### Key flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. |
+| `--build <path\|ver>` / `-b` | local dev | Build to test. |
+| `--threshold <MB>` | `2` | Max per-message heap growth in MB. |
+| `--verbose` | — | Print per-message heap/DOM counts. |
+
+### What it measures
+
+- **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope.
+- **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus.
+
+### Interpreting results
+
+- `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning)
+- `>2.0 MB/msg` — likely leak, investigate retained objects
+- DOM nodes stable after first message — normal (chat list virtualization working)
+- DOM nodes growing linearly — rendering leak, check disposable cleanup
+
+## Architecture
+
+```
+scripts/chat-simulation/
+├── common/
+│   ├── mock-llm-server.js    # Mock CAPI server matching @vscode/copilot-api URL structure
+│   ├── perf-scenarios.js     # Built-in scenario definitions (content, tool-call, multi-turn)
+│   └── utils.js              # Shared: paths, env setup, stats, launch helpers
+├── config.jsonc              # Default config (baseline version, runs, thresholds)
+├── fixtures/                 # TypeScript fixture files used by tool-call scenarios
+├── test-chat-perf-regression.js
+└── test-chat-mem-leaks.js
+```
+
+### Mock server
+
+The mock LLM server (`common/mock-llm-server.js`) implements the full CAPI URL structure from `@vscode/copilot-api`'s `DomainService`:
+
+- `GET /models` — returns model metadata
+- `POST /models/session` — returns `AutoModeAPIResponse` with `available_models` and `session_token`
+- `POST /models/session/intent` — model router
+- `POST /chat/completions` — SSE streaming response matching the scenario
+- Agent, session, telemetry, and token endpoints
+
+The copilot extension connects to this server via `IS_SCENARIO_AUTOMATION=1` mode with `overrideCapiUrl` and `overrideProxyUrl` settings. The `vscode-api-tests` extension must be disabled (`--disable-extension=vscode.vscode-api-tests`) because it contributes a duplicate `copilot` vendor that blocks the real extension's language model provider registration.
+
+### Adding a scenario
+
+1. Add a new entry to the appropriate object (`CONTENT_SCENARIOS`, `TOOL_CALL_SCENARIOS`, or `MULTI_TURN_SCENARIOS`) in `common/perf-scenarios.js` using the `ScenarioBuilder` API from `common/mock-llm-server.js`
+2. The scenario is auto-registered by `registerPerfScenarios()` — no manual ID list to update
+3. Run: `npm run perf:chat -- --scenario your-new-scenario --runs 1 --no-baseline --verbose`
diff --git a/.github/workflows/chat-perf.yml b/.github/workflows/chat-perf.yml
@@ -0,0 +1,209 @@
+name: Chat Performance Comparison
+
+on:
+  pull_request:
+    paths:
+      - '.github/workflows/chat-perf.yml'
+  schedule:
+    # Nightly at 12:00 AM PT (07:00 UTC)
+    - cron: '0 7 * * *'
+  workflow_dispatch:
+    inputs:
+      baseline_commit:
+        description: 'Baseline commit SHA or version (e.g. "1.115.0", "abc1234")'
+        required: true
+        type: string
+      test_commit:
+        description: 'Test commit SHA or version (e.g. "1.115.0", "abc1234")'
+        required: true
+        type: string
+      runs:
+        description: 'Runs per scenario (default: 7 for statistical significance)'
+        required: false
+        type: number
+        default: 7
+      scenarios:
+        description: 'Comma-separated scenario list (empty = all)'
+        required: false
+        type: string
+        default: ''
+      threshold:
+        description: 'Regression threshold fraction (default: 0.2 = 20%)'
+        required: false
+        type: number
+        default: 0.2
+      skip_leak_check:
+        description: 'Skip the memory leak check step'
+        required: false
+        type: boolean
+        default: true
+
+permissions:
+  contents: read
+
+concurrency:
+  group: chat-perf-${{ github.run_id }}
+  cancel-in-progress: true
+
+env:
+  # Only set when explicitly provided; otherwise scripts read config.jsonc
+  BASELINE_COMMIT: ${{ inputs.baseline_commit || '' }}
+  TEST_COMMIT: ${{ inputs.test_commit || '' }}
+  PERF_RUNS: ${{ inputs.runs || '' }}
+  PERF_THRESHOLD: ${{ inputs.threshold || '' }}
+
+jobs:
+  chat-perf:
+    name: Chat Perf
+    runs-on: ubuntu-latest
+    timeout-minutes: 120
+    steps:
+      - name: Checkout test commit
+        uses: actions/checkout@v6
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v6
+        with:
+          node-version-file: .nvmrc
+
+      - name: Install system dependencies
+        run: |
+          sudo apt update -y
+          sudo apt install -y \
+            build-essential pkg-config \
+            libx11-dev libx11-xcb-dev libxkbfile-dev \
+            libnotify-bin libkrb5-dev \
+            xvfb sqlite3 \
+            libnss3 libatk1.0-0 libatk-bridge2.0-0 \
+            libcups2t64 libdrm2 libxcomposite1 libxdamage1 \
+            libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \
+            libasound2t64 libxshmfence1 libgtk-3-0
+
+      - name: Install dependencies
+        run: npm ci
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Install build dependencies
+        run: npm ci
+        working-directory: build
+
+      - name: Transpile source
+        run: npm run transpile-client
+
+      - name: Build copilot extension
+        run: npm run compile
+        working-directory: extensions/copilot
+
+      - name: Download Electron
+        run: node build/lib/preLaunch.ts
+
+      - name: Install Playwright Chromium
+        run: npx playwright install chromium
+
+      - name: Run chat perf comparison
+        id: perf
+        run: |
+          SCENARIO_ARGS=""
+          if [[ -n "${{ inputs.scenarios }}" ]]; then
+            IFS=',' read -ra SCENS <<< "${{ inputs.scenarios }}"
+            for s in "${SCENS[@]}"; do
+              SCENARIO_ARGS="$SCENARIO_ARGS --scenario $(echo "$s" | xargs)"
+            done
+          fi
+
+          PERF_ARGS="--ci"
+          if [[ -n "$BASELINE_COMMIT" ]]; then
+            PERF_ARGS="$PERF_ARGS --baseline-build $BASELINE_COMMIT"
+          fi
+          if [[ -n "$TEST_COMMIT" ]]; then
+            PERF_ARGS="$PERF_ARGS --build $TEST_COMMIT"
+          fi
+          if [[ -n "$PERF_RUNS" ]]; then
+            PERF_ARGS="$PERF_ARGS --runs $PERF_RUNS"
+          fi
+          if [[ -n "$PERF_THRESHOLD" ]]; then
+            PERF_ARGS="$PERF_ARGS --threshold $PERF_THRESHOLD"
+          fi
+
+          xvfb-run node scripts/chat-simulation/test-chat-perf-regression.js \
+            $PERF_ARGS \
+            $SCENARIO_ARGS \
+            2>&1 | tee perf-output.log
+
+          # Extract exit code from the script (tee masks it)
+          exit ${PIPESTATUS[0]}
+        continue-on-error: true
+
+      - name: Run memory leak check
+        id: leak
+        if: inputs.skip_leak_check != true
+        run: |
+          LEAK_ARGS="--verbose"
+          if [[ -n "$TEST_COMMIT" ]]; then
+            LEAK_ARGS="$LEAK_ARGS --build $TEST_COMMIT"
+          fi
+
+          xvfb-run node scripts/chat-simulation/test-chat-mem-leaks.js \
+            $LEAK_ARGS \
+            2>&1 | tee leak-output.log
+
+          exit ${PIPESTATUS[0]}
+        continue-on-error: true
+
+      - name: Write job summary
+        if: always()
+        run: |
+          if [[ -f .chat-simulation-data/ci-summary.md ]]; then
+            cat .chat-simulation-data/ci-summary.md >> "$GITHUB_STEP_SUMMARY"
+          else
+            echo "⚠️ No summary file generated. Check perf-output.log artifact." >> "$GITHUB_STEP_SUMMARY"
+          fi
+
+          if [[ "${{ inputs.skip_leak_check }}" != "true" && -f .chat-simulation-data/chat-simulation-leak-results.json ]]; then
+            echo "" >> "$GITHUB_STEP_SUMMARY"
+            echo "## Memory Leak Check" >> "$GITHUB_STEP_SUMMARY"
+            echo "" >> "$GITHUB_STEP_SUMMARY"
+            echo '```json' >> "$GITHUB_STEP_SUMMARY"
+            cat .chat-simulation-data/chat-simulation-leak-results.json >> "$GITHUB_STEP_SUMMARY"
+            echo '```' >> "$GITHUB_STEP_SUMMARY"
+          fi
+
+      - name: Zip diagnostic outputs
+        if: always()
+        run: |
+          # Find the most recent timestamped run directory
+          RUN_DIR=$(ls -td .chat-simulation-data/20*/ 2>/dev/null | head -1)
+          if [[ -n "$RUN_DIR" ]]; then
+            # Zip everything: results JSON, CPU profiles, traces, heap snapshots
+            cd .chat-simulation-data
+            zip -r ../chat-perf-artifacts.zip \
+              "$(basename "$RUN_DIR")"/ \
+              ci-summary.md \
+              baseline-*.json \
+              chat-simulation-leak-results.json \
+              2>/dev/null || true
+            cd ..
+          fi
+
+      - name: Upload perf artifacts
+        if: always()
+        uses: actions/upload-artifact@v7
+        with:
+          name: chat-perf-${{ env.BASELINE_COMMIT || 'default-baseline' }}-vs-${{ env.TEST_COMMIT }}
+          path: |
+            chat-perf-artifacts.zip
+            perf-output.log
+            leak-output.log
+          retention-days: 30
+
+      - name: Fail on regression
+        if: steps.perf.outcome == 'failure' || (inputs.skip_leak_check != true && steps.leak.outcome == 'failure')
+        run: |
+          if [[ "${{ steps.perf.outcome }}" == "failure" ]]; then
+            echo "::error::Chat performance regression detected. See job summary for details."
+          fi
+          if [[ "${{ inputs.skip_leak_check }}" != "true" && "${{ steps.leak.outcome }}" == "failure" ]]; then
+            echo "::error::Chat memory leak detected. See leak-output.log for details."
+          fi
+          exit 1
diff --git a/.gitignore b/.gitignore
@@ -25,6 +25,7 @@ product.overrides.json
 *.snap.actual
 *.tsbuildinfo
 .vscode-test
+.chat-simulation-data
 vscode-telemetry-docs/
 test-output.json
 test/componentFixtures/.screenshots/*

diff --git a/build/filters.ts b/build/filters.ts
@@ -162,6 +162,7 @@ export const copyrightFilter = Object.freeze<string[]>([
 	'**',
 	'!**/*.desktop',
 	'!**/*.json',
+	'!**/*.jsonc',
 	'!**/*.jsonl',
 	'!**/*.html',
 	'!**/*.template',