Skip to content

update eval-driven-dev skill#1333

Closed
yiouli wants to merge 0 commit intogithub:stagedfrom
yiouli:staged
Closed

update eval-driven-dev skill#1333
yiouli wants to merge 0 commit intogithub:stagedfrom
yiouli:staged

Conversation

@yiouli
Copy link
Copy Markdown
Contributor

@yiouli yiouli commented Apr 8, 2026

Pull Request Checklist

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have read and followed the Guidance for submissions involving paid services.
  • My contribution adds a new instruction, prompt, agent, skill, or workflow file in the correct directory.
  • The file follows the required naming convention.
  • The content is clearly structured and follows the example format.
  • I have tested my instructions, prompt, agent, skill, or workflow with GitHub Copilot.
  • I have run npm start and verified that README.md is up to date.
  • I am targeting the staged branch for this pull request.

Description

Update eval-driven-dev skill with streamlined workflow.

Shorten the workflow steps and reduce the amount of setup required.


Type of Contribution

  • New instruction file.
  • New prompt file.
  • New agent file.
  • New plugin.
  • New skill file.
  • New agentic workflow.
  • Update to existing instruction, prompt, agent, plugin, skill, or workflow.
  • Other (please specify):

Additional Notes


By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.

@yiouli yiouli requested a review from aaronpowell as a code owner April 8, 2026 15:21
Copilot AI review requested due to automatic review settings April 8, 2026 15:21
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔍 Skill Validator Results

2 resource(s) checked | ✅ All checks passed

Full output
Found 1 skill(s)
[eval-driven-dev] 📊 eval-driven-dev: 2,484 BPE tokens [chars/4: 2,730] (detailed ✓), 12 sections, 3 code blocks
[eval-driven-dev]    ⚠  No numbered workflow steps — agents follow sequenced procedures more reliably.
�[32m✅ All checks passed (1 skill(s))�[0m

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the eval-driven-dev skill to a more streamlined, wrap()-based workflow aligned with newer pixie-qa concepts, adding runnable/evaluator/dataset references and helper scripts to reduce manual setup.

Changes:

  • Updates the core skill workflow from @observe/manual mocking guidance to wrap() + Runnable + dataset-driven injection, with new step references and artifacts.
  • Adds resources/setup.sh and resources/stop-server.sh to automate environment/package setup and web UI lifecycle.
  • Replaces older reference docs with new auto-generated API references (wrap-api.md, testing-api.md, evaluators.md) and step-specific guides (Steps 1–6).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
skills/eval-driven-dev/SKILL.md Updates the skill’s workflow steps, checkpoints, and setup guidance for wrap()/Runnable-based evals.
skills/eval-driven-dev/resources/setup.sh Adds a setup script to update the skill, install/upgrade pixie-qa, init pixie_qa/, and start the web UI server.
skills/eval-driven-dev/resources/stop-server.sh Adds a stop script intended to kill the web UI server and clean up lock state.
skills/eval-driven-dev/resources/run-with-timeout.sh Adds a generic helper to run a command in the background and auto-kill it after a timeout.
skills/eval-driven-dev/references/wrap-api.md Adds auto-generated reference for wrap(), Runnable, and related CLI commands/errors.
skills/eval-driven-dev/references/testing-api.md Adds auto-generated reference for dataset/test execution APIs and dataset JSON format.
skills/eval-driven-dev/references/evaluators.md Adds auto-generated catalog of built-in evaluators and create_llm_evaluator guidance.
skills/eval-driven-dev/references/1-a-entry-point.md Adds Step 1a reference for documenting app entry point and execution flow.
skills/eval-driven-dev/references/1-b-eval-criteria.md Adds Step 1b reference for defining use cases and eval criteria.
skills/eval-driven-dev/references/2-wrap-and-trace.md Adds Step 2 reference for adding wrap() points, implementing a Runnable, and capturing a reference trace.
skills/eval-driven-dev/references/3-define-evaluators.md Adds Step 3 reference for mapping criteria to evaluators and producing an evaluator mapping artifact.
skills/eval-driven-dev/references/4-build-dataset.md Adds Step 4 reference for building dataset JSON using entry_kwargs + test_case.eval_input + evaluator assignment.
skills/eval-driven-dev/references/5-run-tests.md Adds Step 5 reference for running pixie test, fixing dataset/harness errors, and running pixie analyze.
skills/eval-driven-dev/references/6-investigate.md Updates investigation guidance (including a new “stop/continue” gate and analysis-first workflow).
skills/eval-driven-dev/references/understanding-app.md Removes the older Step 1 reference doc (superseded by new Step 1a/1b docs).
skills/eval-driven-dev/references/instrumentation.md Removes the older @observe/enable_storage() instrumentation guidance (superseded by wrap() docs).
skills/eval-driven-dev/references/run-harness-patterns.md Removes older run-harness examples (superseded by Runnable guidance).
skills/eval-driven-dev/references/pixie-api.md Removes older pixie API reference (superseded by new auto-generated refs).
skills/eval-driven-dev/references/eval-tests.md Removes older eval test writing reference (superseded by Step 5 + API refs).
skills/eval-driven-dev/references/dataset-generation.md Removes older dataset generation reference (superseded by Step 4 + API refs).
docs/README.skills.md Updates the skill listing to reflect the new wrap() workflow and reference/resource files.
Comments suppressed due to low confidence (2)

skills/eval-driven-dev/references/6-investigate.md:40

  • Minor wording issue: “The the captured output” has an extra “the”. Suggest changing to “The captured output” (or “eval_output / captured output”) for clarity.
    skills/eval-driven-dev/references/6-investigate.md:110
  • Minor wording issue in the example: “the captured output” has an extra “the”. Suggest changing to “captured output” (or “eval_output / captured output”) to avoid confusing readers.

**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.

**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.
**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement that external data sources are “handled automatically by wrap(purpose="input") at test time” is a bit misleading: this only works for dependencies that have been explicitly wrapped and for which each dataset entry provides the corresponding test_case.eval_input items. Consider rewording to clarify that external dependencies must be wrapped (and populated via the dataset) for injection to happen; otherwise they still need traditional mocking or will be called for real.

Suggested change
**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time.
**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are only injected automatically at test time if the dependency has been explicitly wrapped with `wrap(purpose="input")` and each dataset entry provides the corresponding `test_case.eval_input` values. If an external dependency is not wrapped and populated this way, you still need traditional mocking or it will be called for real.

Copilot uses AI. Check for mistakes.
Comment on lines 148 to 162
## Web Server Management

### Imports
pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script, and needs to be explicitly cleaned up when display is no longer needed.

```python
from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
```

Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.

### CLI commands
When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with the following command:

```bash
uv run pixie test pixie_qa/tests/ -v # Run eval tests (NOT pytest)
uv run pixie trace list # List captured traces
uv run pixie trace last # Show most recent trace
uv run pixie trace show <id> --verbose # Show specific trace
uv run pixie dataset create <name> # Create a new dataset
bash resources/stop-server.sh
```

### Directory layout
And whenever you restart the workflow, always run the setup script again to ensure the web server is running:

```bash
bash resources/setup.sh
```
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resources/run-with-timeout.sh is added to the skill but isn’t referenced anywhere in SKILL.md or the step reference docs. Either document when/how to use it (so the bundled asset is discoverable) or remove it to avoid carrying unused tooling in the skill bundle.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +19
PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
LOCK_FILE="${PIXIE_ROOT}/server.lock"

if [ ! -f "$LOCK_FILE" ]; then
echo "No server.lock found at ${LOCK_FILE} — server may not be running."
exit 0
fi

PORT=$(cat "$LOCK_FILE")
echo "Found server.lock with port ${PORT}"

# Find the process listening on that port and kill it
PID=$(lsof -ti "tcp:${PORT}" 2>/dev/null || ss -tlnp "sport = :${PORT}" 2>/dev/null | grep -oP 'pid=\K[0-9]+' || true)

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop-server.sh assumes ${PIXIE_ROOT}/server.lock exists and contains the server port, but nothing in this skill writes that file (and there are no other references to it in-repo). Unless pixie start itself creates server.lock, this script will be a no-op. Consider switching to a PID-file approach (write $! to ${PIXIE_ROOT}/server.pid in setup.sh and kill that PID here), or explicitly create/maintain server.lock in setup.sh.

Also, grep -oP (PCRE) is not portable across environments (e.g., macOS default grep), so the ss | grep -oP ... fallback is likely to fail on some platforms.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +42
echo "=== Starting web UI server (background) ==="
PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
if [ -f uv.lock ]; then
nohup uv run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
elif [ -f poetry.lock ]; then
nohup poetry run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
else
nohup pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
fi
echo "Web UI server started (PID $!, log: ${PIXIE_ROOT}/server.log)"
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup script starts the web UI server in the background but doesn’t write any metadata that stop-server.sh can use to stop it (it currently looks for ${PIXIE_ROOT}/server.lock). Consider persisting the background PID (e.g., ${PIXIE_ROOT}/server.pid) and/or the chosen port so the stop script can reliably clean up the correct process across platforms/shell sessions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@utkarsh232005 utkarsh232005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop-server.sh is broken. It looks for a server.lock file that setup.sh never creates. Also, grep -oP isn't portable and will fail on macOS also SKILL.md implies data injection is "automatic," but it requires explicit wrap() calls and dataset setup. Please clarify this.

Copy link
Copy Markdown
Contributor

@aaronpowell aaronpowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the skills spec the scripts should be in the scripts folder, not resources.

That said, the scripts I don't think are a good idea because they are not cross-OS compatible - if you want to use this Skill on Windows a Bash script isn't going to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants