-
Notifications
You must be signed in to change notification settings - Fork 39.2k
Add performance tests #309700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
pwang347
wants to merge
13
commits into
main
Choose a base branch
from
pawang/perfTesting
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Add performance tests #309700
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
620c954
updates
pwang347 a4a562b
PR
pwang347 2e1b4a7
wip
pwang347 36ef007
PR
pwang347 bcdcda3
PR
pwang347 9cfbfd6
more metrics
pwang347 4d8aad2
update
pwang347 1bb24d5
clean
pwang347 3bd7ba3
pipeline fix
pwang347 1649a5d
fix
pwang347 df6478c
updates
pwang347 e756e47
PR
pwang347 3f6aac3
update
pwang347 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| --- | ||
| name: chat-perf | ||
| description: Run chat perf benchmarks and memory leak checks against the local dev build or any published VS Code version. Use when investigating chat rendering regressions, validating perf-sensitive changes to chat UI, or checking for memory leaks in the chat response pipeline. | ||
| --- | ||
|
|
||
| # Chat Performance Testing | ||
|
|
||
| ## When to use | ||
|
|
||
| - Before/after modifying chat rendering code (`chatListRenderer.ts`, `chatInputPart.ts`, markdown rendering) | ||
| - When changing the streaming response pipeline or SSE processing | ||
| - When modifying disposable/lifecycle patterns in chat components | ||
| - To compare performance between two VS Code releases | ||
| - In CI to gate PRs that touch chat UI code | ||
|
|
||
| ## Quick start | ||
|
|
||
| ```bash | ||
| # Run perf regression test (compares local dev build vs VS Code 1.115.0): | ||
| npm run perf:chat -- --scenario text-only --runs 3 | ||
|
|
||
| # Run all scenarios with no baseline (just measure): | ||
| npm run perf:chat -- --no-baseline --runs 3 | ||
|
|
||
| # Run memory leak check (10 messages in one session): | ||
| npm run perf:chat-leak | ||
|
|
||
| # Run leak check with more messages for accuracy: | ||
| npm run perf:chat-leak -- --messages 20 --verbose | ||
| ``` | ||
|
|
||
| ## Perf regression test | ||
|
|
||
| **Script:** `scripts/chat-simulation/test-chat-perf-regression.js` | ||
| **npm:** `npm run perf:chat` | ||
|
|
||
| Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares. | ||
|
|
||
| ### Key flags | ||
|
|
||
| | Flag | Default | Description | | ||
| |---|---|---| | ||
| | `--runs <n>` | `5` | Runs per scenario. More = more stable. Use 5+ for CI. | | ||
| | `--scenario <id>` / `-s` | all | Scenario to test (repeatable). See `common/perf-scenarios.js`. | | ||
| | `--build <path\|ver>` / `-b` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`, commit hash). | | ||
| | `--baseline <path>` | — | Compare against a previously saved baseline JSON file. | | ||
| | `--baseline-build <ver>` | `1.115.0` | Version to download and benchmark as baseline. | | ||
| | `--no-baseline` | — | Skip baseline comparison entirely. | | ||
| | `--save-baseline` | — | Save results as the new baseline (requires `--baseline <path>`). | | ||
| | `--resume <path>` | — | Resume a previous run, adding more iterations to increase confidence. | | ||
| | `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). | | ||
| | `--no-cache` | — | Ignore cached baseline data, always run fresh. | | ||
| | `--ci` | — | CI mode: write Markdown summary to `ci-summary.md` (implies `--no-cache`). | | ||
| | `--verbose` | — | Print per-run details including response content. | | ||
|
|
||
| ### Comparing two remote builds | ||
|
|
||
| ```bash | ||
| # Compare 1.110.0 against 1.115.0 (no local build needed): | ||
| npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5 | ||
| ``` | ||
|
|
||
| ### Resuming a run for more confidence | ||
|
|
||
| When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run: | ||
|
|
||
| ```bash | ||
| # Initial run with 3 iterations — may be inconclusive: | ||
| npm run perf:chat -- --scenario text-only --runs 3 | ||
|
|
||
| # Add 3 more runs to the same results file (both test + baseline): | ||
| npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 3 | ||
|
|
||
| # Keep adding until confidence is reached: | ||
| npm run perf:chat -- --resume .chat-simulation-data/2026-04-14T02-15-14/results.json --runs 5 | ||
| ``` | ||
|
|
||
| `--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate. | ||
|
|
||
| ### Statistical significance | ||
|
|
||
| Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`. | ||
|
|
||
| With typical variance (cv ≈ 20%), you need: | ||
| - **n ≥ 5** per build to detect a 35% regression at 95% confidence | ||
| - **n ≥ 10** per build to detect a 20% regression reliably | ||
|
|
||
| Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`. | ||
|
|
||
| ### Exit codes | ||
|
|
||
| - `0` — all metrics within threshold, or exceeding threshold but not statistically significant | ||
| - `1` — statistically significant regression detected, or all runs failed | ||
|
|
||
| ### Scenarios | ||
|
|
||
| Scenarios are defined in `scripts/chat-simulation/common/perf-scenarios.js` and registered via `registerPerfScenarios()`. There are three categories: | ||
|
|
||
| - **Content-only** — plain streaming responses (e.g. `text-only`, `large-codeblock`, `rapid-stream`) | ||
| - **Tool-call** — multi-turn scenarios with tool invocations (e.g. `tool-read-file`, `tool-edit-file`) | ||
| - **Multi-turn user** — multi-turn conversations with user follow-ups, thinking blocks (e.g. `thinking-response`, `multi-turn-user`, `long-conversation`) | ||
|
|
||
| Run `npm run perf:chat -- --help` to see the full list of registered scenario IDs. | ||
|
|
||
| ### Metrics collected | ||
|
|
||
| - **Timing:** time to first token, time to complete (prefers internal `code/chat/*` perf marks, falls back to client-side measurement) | ||
| - **Rendering:** layout count, style recalculation count, forced reflows, long tasks (>50ms) | ||
| - **Memory:** heap before/after (informational, noisy for single requests) | ||
|
|
||
| ### Statistics | ||
|
|
||
| Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results. | ||
|
|
||
| ## Memory leak check | ||
|
|
||
| **Script:** `scripts/chat-simulation/test-chat-mem-leaks.js` | ||
| **npm:** `npm run perf:chat-leak` | ||
|
|
||
| Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold. | ||
|
|
||
| ### Key flags | ||
|
|
||
| | Flag | Default | Description | | ||
| |---|---|---| | ||
| | `--messages <n>` / `-n` | `10` | Number of messages to send. More = more accurate slope. | | ||
| | `--build <path\|ver>` / `-b` | local dev | Build to test. | | ||
| | `--threshold <MB>` | `2` | Max per-message heap growth in MB. | | ||
| | `--verbose` | — | Print per-message heap/DOM counts. | | ||
|
|
||
| ### What it measures | ||
|
|
||
| - **Heap growth slope** (MB/message) — linear regression over forced-GC heap samples. A leak shows as sustained positive slope. | ||
| - **DOM node growth** (nodes/message) — catches rendering leaks where elements aren't cleaned up. Healthy chat virtualizes old messages so node count plateaus. | ||
|
|
||
| ### Interpreting results | ||
|
|
||
| - `0.3–1.0 MB/msg` — normal (V8 internal overhead, string interning) | ||
| - `>2.0 MB/msg` — likely leak, investigate retained objects | ||
| - DOM nodes stable after first message — normal (chat list virtualization working) | ||
| - DOM nodes growing linearly — rendering leak, check disposable cleanup | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| scripts/chat-simulation/ | ||
| ├── common/ | ||
| │ ├── mock-llm-server.js # Mock CAPI server matching @vscode/copilot-api URL structure | ||
| │ ├── perf-scenarios.js # Built-in scenario definitions (content, tool-call, multi-turn) | ||
| │ └── utils.js # Shared: paths, env setup, stats, launch helpers | ||
| ├── config.jsonc # Default config (baseline version, runs, thresholds) | ||
| ├── fixtures/ # TypeScript fixture files used by tool-call scenarios | ||
| ├── test-chat-perf-regression.js | ||
| └── test-chat-mem-leaks.js | ||
| ``` | ||
|
|
||
| ### Mock server | ||
|
|
||
| The mock LLM server (`common/mock-llm-server.js`) implements the full CAPI URL structure from `@vscode/copilot-api`'s `DomainService`: | ||
|
|
||
| - `GET /models` — returns model metadata | ||
| - `POST /models/session` — returns `AutoModeAPIResponse` with `available_models` and `session_token` | ||
| - `POST /models/session/intent` — model router | ||
| - `POST /chat/completions` — SSE streaming response matching the scenario | ||
| - Agent, session, telemetry, and token endpoints | ||
|
|
||
| The copilot extension connects to this server via `IS_SCENARIO_AUTOMATION=1` mode with `overrideCapiUrl` and `overrideProxyUrl` settings. The `vscode-api-tests` extension must be disabled (`--disable-extension=vscode.vscode-api-tests`) because it contributes a duplicate `copilot` vendor that blocks the real extension's language model provider registration. | ||
|
|
||
| ### Adding a scenario | ||
|
|
||
| 1. Add a new entry to the appropriate object (`CONTENT_SCENARIOS`, `TOOL_CALL_SCENARIOS`, or `MULTI_TURN_SCENARIOS`) in `common/perf-scenarios.js` using the `ScenarioBuilder` API from `common/mock-llm-server.js` | ||
| 2. The scenario is auto-registered by `registerPerfScenarios()` — no manual ID list to update | ||
| 3. Run: `npm run perf:chat -- --scenario your-new-scenario --runs 1 --no-baseline --verbose` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,209 @@ | ||
| name: Chat Performance Comparison | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - '.github/workflows/chat-perf.yml' | ||
| schedule: | ||
| # Nightly at 12:00 AM PT (07:00 UTC) | ||
| - cron: '0 7 * * *' | ||
| workflow_dispatch: | ||
| inputs: | ||
| baseline_commit: | ||
| description: 'Baseline commit SHA or version (e.g. "1.115.0", "abc1234")' | ||
| required: true | ||
| type: string | ||
| test_commit: | ||
| description: 'Test commit SHA or version (e.g. "1.115.0", "abc1234")' | ||
| required: true | ||
| type: string | ||
| runs: | ||
| description: 'Runs per scenario (default: 7 for statistical significance)' | ||
| required: false | ||
| type: number | ||
| default: 7 | ||
| scenarios: | ||
| description: 'Comma-separated scenario list (empty = all)' | ||
| required: false | ||
| type: string | ||
| default: '' | ||
| threshold: | ||
| description: 'Regression threshold fraction (default: 0.2 = 20%)' | ||
| required: false | ||
| type: number | ||
| default: 0.2 | ||
| skip_leak_check: | ||
| description: 'Skip the memory leak check step' | ||
| required: false | ||
| type: boolean | ||
| default: true | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| concurrency: | ||
| group: chat-perf-${{ github.run_id }} | ||
| cancel-in-progress: true | ||
|
|
||
| env: | ||
| # Only set when explicitly provided; otherwise scripts read config.jsonc | ||
| BASELINE_COMMIT: ${{ inputs.baseline_commit || '' }} | ||
| TEST_COMMIT: ${{ inputs.test_commit || '' }} | ||
| PERF_RUNS: ${{ inputs.runs || '' }} | ||
| PERF_THRESHOLD: ${{ inputs.threshold || '' }} | ||
|
|
||
| jobs: | ||
| chat-perf: | ||
| name: Chat Perf | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 120 | ||
| steps: | ||
| - name: Checkout test commit | ||
| uses: actions/checkout@v6 | ||
|
|
||
| - name: Setup Node.js | ||
| uses: actions/setup-node@v6 | ||
| with: | ||
| node-version-file: .nvmrc | ||
|
|
||
| - name: Install system dependencies | ||
| run: | | ||
| sudo apt update -y | ||
| sudo apt install -y \ | ||
| build-essential pkg-config \ | ||
| libx11-dev libx11-xcb-dev libxkbfile-dev \ | ||
| libnotify-bin libkrb5-dev \ | ||
| xvfb sqlite3 \ | ||
| libnss3 libatk1.0-0 libatk-bridge2.0-0 \ | ||
| libcups2t64 libdrm2 libxcomposite1 libxdamage1 \ | ||
| libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \ | ||
| libasound2t64 libxshmfence1 libgtk-3-0 | ||
|
|
||
| - name: Install dependencies | ||
| run: npm ci | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Install build dependencies | ||
| run: npm ci | ||
| working-directory: build | ||
|
|
||
| - name: Transpile source | ||
| run: npm run transpile-client | ||
|
|
||
| - name: Build copilot extension | ||
| run: npm run compile | ||
| working-directory: extensions/copilot | ||
|
|
||
| - name: Download Electron | ||
| run: node build/lib/preLaunch.ts | ||
|
|
||
| - name: Install Playwright Chromium | ||
| run: npx playwright install chromium | ||
|
|
||
| - name: Run chat perf comparison | ||
| id: perf | ||
| run: | | ||
| SCENARIO_ARGS="" | ||
| if [[ -n "${{ inputs.scenarios }}" ]]; then | ||
| IFS=',' read -ra SCENS <<< "${{ inputs.scenarios }}" | ||
| for s in "${SCENS[@]}"; do | ||
| SCENARIO_ARGS="$SCENARIO_ARGS --scenario $(echo "$s" | xargs)" | ||
| done | ||
| fi | ||
|
|
||
| PERF_ARGS="--ci" | ||
| if [[ -n "$BASELINE_COMMIT" ]]; then | ||
| PERF_ARGS="$PERF_ARGS --baseline-build $BASELINE_COMMIT" | ||
| fi | ||
| if [[ -n "$TEST_COMMIT" ]]; then | ||
| PERF_ARGS="$PERF_ARGS --build $TEST_COMMIT" | ||
| fi | ||
| if [[ -n "$PERF_RUNS" ]]; then | ||
| PERF_ARGS="$PERF_ARGS --runs $PERF_RUNS" | ||
| fi | ||
| if [[ -n "$PERF_THRESHOLD" ]]; then | ||
| PERF_ARGS="$PERF_ARGS --threshold $PERF_THRESHOLD" | ||
| fi | ||
|
|
||
| xvfb-run node scripts/chat-simulation/test-chat-perf-regression.js \ | ||
| $PERF_ARGS \ | ||
| $SCENARIO_ARGS \ | ||
| 2>&1 | tee perf-output.log | ||
|
|
||
| # Extract exit code from the script (tee masks it) | ||
| exit ${PIPESTATUS[0]} | ||
| continue-on-error: true | ||
|
|
||
| - name: Run memory leak check | ||
| id: leak | ||
| if: inputs.skip_leak_check != true | ||
| run: | | ||
| LEAK_ARGS="--verbose" | ||
| if [[ -n "$TEST_COMMIT" ]]; then | ||
| LEAK_ARGS="$LEAK_ARGS --build $TEST_COMMIT" | ||
| fi | ||
|
|
||
| xvfb-run node scripts/chat-simulation/test-chat-mem-leaks.js \ | ||
| $LEAK_ARGS \ | ||
| 2>&1 | tee leak-output.log | ||
|
|
||
| exit ${PIPESTATUS[0]} | ||
| continue-on-error: true | ||
|
|
||
| - name: Write job summary | ||
| if: always() | ||
| run: | | ||
| if [[ -f .chat-simulation-data/ci-summary.md ]]; then | ||
| cat .chat-simulation-data/ci-summary.md >> "$GITHUB_STEP_SUMMARY" | ||
| else | ||
| echo "⚠️ No summary file generated. Check perf-output.log artifact." >> "$GITHUB_STEP_SUMMARY" | ||
| fi | ||
|
|
||
| if [[ "${{ inputs.skip_leak_check }}" != "true" && -f .chat-simulation-data/chat-simulation-leak-results.json ]]; then | ||
| echo "" >> "$GITHUB_STEP_SUMMARY" | ||
| echo "## Memory Leak Check" >> "$GITHUB_STEP_SUMMARY" | ||
| echo "" >> "$GITHUB_STEP_SUMMARY" | ||
| echo '```json' >> "$GITHUB_STEP_SUMMARY" | ||
| cat .chat-simulation-data/chat-simulation-leak-results.json >> "$GITHUB_STEP_SUMMARY" | ||
| echo '```' >> "$GITHUB_STEP_SUMMARY" | ||
| fi | ||
|
|
||
| - name: Zip diagnostic outputs | ||
| if: always() | ||
| run: | | ||
| # Find the most recent timestamped run directory | ||
| RUN_DIR=$(ls -td .chat-simulation-data/20*/ 2>/dev/null | head -1) | ||
| if [[ -n "$RUN_DIR" ]]; then | ||
| # Zip everything: results JSON, CPU profiles, traces, heap snapshots | ||
| cd .chat-simulation-data | ||
| zip -r ../chat-perf-artifacts.zip \ | ||
| "$(basename "$RUN_DIR")"/ \ | ||
| ci-summary.md \ | ||
| baseline-*.json \ | ||
| chat-simulation-leak-results.json \ | ||
| 2>/dev/null || true | ||
| cd .. | ||
| fi | ||
|
|
||
| - name: Upload perf artifacts | ||
| if: always() | ||
| uses: actions/upload-artifact@v7 | ||
| with: | ||
| name: chat-perf-${{ env.BASELINE_COMMIT || 'default-baseline' }}-vs-${{ env.TEST_COMMIT }} | ||
| path: | | ||
| chat-perf-artifacts.zip | ||
| perf-output.log | ||
| leak-output.log | ||
| retention-days: 30 | ||
|
|
||
| - name: Fail on regression | ||
| if: steps.perf.outcome == 'failure' || (inputs.skip_leak_check != true && steps.leak.outcome == 'failure') | ||
| run: | | ||
| if [[ "${{ steps.perf.outcome }}" == "failure" ]]; then | ||
| echo "::error::Chat performance regression detected. See job summary for details." | ||
| fi | ||
| if [[ "${{ inputs.skip_leak_check }}" != "true" && "${{ steps.leak.outcome }}" == "failure" ]]; then | ||
| echo "::error::Chat memory leak detected. See leak-output.log for details." | ||
| fi | ||
| exit 1 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.