Conversation
🔍 Skill Validator Results2 resource(s) checked | ✅ All checks passed Full output |
There was a problem hiding this comment.
Pull request overview
This PR updates the eval-driven-dev skill to a more streamlined, wrap()-based workflow aligned with newer pixie-qa concepts, adding runnable/evaluator/dataset references and helper scripts to reduce manual setup.
Changes:
- Updates the core skill workflow from
@observe/manual mocking guidance towrap()+Runnable+ dataset-driven injection, with new step references and artifacts. - Adds
resources/setup.shandresources/stop-server.shto automate environment/package setup and web UI lifecycle. - Replaces older reference docs with new auto-generated API references (
wrap-api.md,testing-api.md,evaluators.md) and step-specific guides (Steps 1–6).
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/eval-driven-dev/SKILL.md | Updates the skill’s workflow steps, checkpoints, and setup guidance for wrap()/Runnable-based evals. |
| skills/eval-driven-dev/resources/setup.sh | Adds a setup script to update the skill, install/upgrade pixie-qa, init pixie_qa/, and start the web UI server. |
| skills/eval-driven-dev/resources/stop-server.sh | Adds a stop script intended to kill the web UI server and clean up lock state. |
| skills/eval-driven-dev/resources/run-with-timeout.sh | Adds a generic helper to run a command in the background and auto-kill it after a timeout. |
| skills/eval-driven-dev/references/wrap-api.md | Adds auto-generated reference for wrap(), Runnable, and related CLI commands/errors. |
| skills/eval-driven-dev/references/testing-api.md | Adds auto-generated reference for dataset/test execution APIs and dataset JSON format. |
| skills/eval-driven-dev/references/evaluators.md | Adds auto-generated catalog of built-in evaluators and create_llm_evaluator guidance. |
| skills/eval-driven-dev/references/1-a-entry-point.md | Adds Step 1a reference for documenting app entry point and execution flow. |
| skills/eval-driven-dev/references/1-b-eval-criteria.md | Adds Step 1b reference for defining use cases and eval criteria. |
| skills/eval-driven-dev/references/2-wrap-and-trace.md | Adds Step 2 reference for adding wrap() points, implementing a Runnable, and capturing a reference trace. |
| skills/eval-driven-dev/references/3-define-evaluators.md | Adds Step 3 reference for mapping criteria to evaluators and producing an evaluator mapping artifact. |
| skills/eval-driven-dev/references/4-build-dataset.md | Adds Step 4 reference for building dataset JSON using entry_kwargs + test_case.eval_input + evaluator assignment. |
| skills/eval-driven-dev/references/5-run-tests.md | Adds Step 5 reference for running pixie test, fixing dataset/harness errors, and running pixie analyze. |
| skills/eval-driven-dev/references/6-investigate.md | Updates investigation guidance (including a new “stop/continue” gate and analysis-first workflow). |
| skills/eval-driven-dev/references/understanding-app.md | Removes the older Step 1 reference doc (superseded by new Step 1a/1b docs). |
| skills/eval-driven-dev/references/instrumentation.md | Removes the older @observe/enable_storage() instrumentation guidance (superseded by wrap() docs). |
| skills/eval-driven-dev/references/run-harness-patterns.md | Removes older run-harness examples (superseded by Runnable guidance). |
| skills/eval-driven-dev/references/pixie-api.md | Removes older pixie API reference (superseded by new auto-generated refs). |
| skills/eval-driven-dev/references/eval-tests.md | Removes older eval test writing reference (superseded by Step 5 + API refs). |
| skills/eval-driven-dev/references/dataset-generation.md | Removes older dataset generation reference (superseded by Step 4 + API refs). |
| docs/README.skills.md | Updates the skill listing to reflect the new wrap() workflow and reference/resource files. |
Comments suppressed due to low confidence (2)
skills/eval-driven-dev/references/6-investigate.md:40
- Minor wording issue: “The
the captured output” has an extra “the”. Suggest changing to “The captured output” (or “eval_output/ captured output”) for clarity.
skills/eval-driven-dev/references/6-investigate.md:110 - Minor wording issue in the example: “the captured output” has an extra “the”. Suggest changing to “captured output” (or “eval_output / captured output”) to avoid confusing readers.
skills/eval-driven-dev/SKILL.md
Outdated
| **What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM. | ||
|
|
||
| **What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness. | ||
| **What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time. |
There was a problem hiding this comment.
The statement that external data sources are “handled automatically by wrap(purpose="input") at test time” is a bit misleading: this only works for dependencies that have been explicitly wrapped and for which each dataset entry provides the corresponding test_case.eval_input items. Consider rewording to clarify that external dependencies must be wrapped (and populated via the dataset) for injection to happen; otherwise they still need traditional mocking or will be called for real.
| **What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time. | |
| **What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are only injected automatically at test time if the dependency has been explicitly wrapped with `wrap(purpose="input")` and each dataset entry provides the corresponding `test_case.eval_input` values. If an external dependency is not wrapped and populated this way, you still need traditional mocking or it will be called for real. |
skills/eval-driven-dev/SKILL.md
Outdated
| ## Web Server Management | ||
|
|
||
| ### Imports | ||
| pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script, and needs to be explicitly cleaned up when display is no longer needed. | ||
|
|
||
| ```python | ||
| from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call | ||
| from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator | ||
| ``` | ||
|
|
||
| Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module. | ||
|
|
||
| ### CLI commands | ||
| When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with the following command: | ||
|
|
||
| ```bash | ||
| uv run pixie test pixie_qa/tests/ -v # Run eval tests (NOT pytest) | ||
| uv run pixie trace list # List captured traces | ||
| uv run pixie trace last # Show most recent trace | ||
| uv run pixie trace show <id> --verbose # Show specific trace | ||
| uv run pixie dataset create <name> # Create a new dataset | ||
| bash resources/stop-server.sh | ||
| ``` | ||
|
|
||
| ### Directory layout | ||
| And whenever you restart the workflow, always run the setup script again to ensure the web server is running: | ||
|
|
||
| ```bash | ||
| bash resources/setup.sh | ||
| ``` |
There was a problem hiding this comment.
resources/run-with-timeout.sh is added to the skill but isn’t referenced anywhere in SKILL.md or the step reference docs. Either document when/how to use it (so the bundled asset is discoverable) or remove it to avoid carrying unused tooling in the skill bundle.
| PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}" | ||
| LOCK_FILE="${PIXIE_ROOT}/server.lock" | ||
|
|
||
| if [ ! -f "$LOCK_FILE" ]; then | ||
| echo "No server.lock found at ${LOCK_FILE} — server may not be running." | ||
| exit 0 | ||
| fi | ||
|
|
||
| PORT=$(cat "$LOCK_FILE") | ||
| echo "Found server.lock with port ${PORT}" | ||
|
|
||
| # Find the process listening on that port and kill it | ||
| PID=$(lsof -ti "tcp:${PORT}" 2>/dev/null || ss -tlnp "sport = :${PORT}" 2>/dev/null | grep -oP 'pid=\K[0-9]+' || true) | ||
|
|
There was a problem hiding this comment.
stop-server.sh assumes ${PIXIE_ROOT}/server.lock exists and contains the server port, but nothing in this skill writes that file (and there are no other references to it in-repo). Unless pixie start itself creates server.lock, this script will be a no-op. Consider switching to a PID-file approach (write $! to ${PIXIE_ROOT}/server.pid in setup.sh and kill that PID here), or explicitly create/maintain server.lock in setup.sh.
Also, grep -oP (PCRE) is not portable across environments (e.g., macOS default grep), so the ss | grep -oP ... fallback is likely to fail on some platforms.
| echo "=== Starting web UI server (background) ===" | ||
| PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}" | ||
| if [ -f uv.lock ]; then | ||
| nohup uv run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 & | ||
| elif [ -f poetry.lock ]; then | ||
| nohup poetry run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 & | ||
| else | ||
| nohup pixie start > "${PIXIE_ROOT}/server.log" 2>&1 & | ||
| fi | ||
| echo "Web UI server started (PID $!, log: ${PIXIE_ROOT}/server.log)" |
There was a problem hiding this comment.
The setup script starts the web UI server in the background but doesn’t write any metadata that stop-server.sh can use to stop it (it currently looks for ${PIXIE_ROOT}/server.lock). Consider persisting the background PID (e.g., ${PIXIE_ROOT}/server.pid) and/or the chosen port so the stop script can reliably clean up the correct process across platforms/shell sessions.
utkarsh232005
left a comment
There was a problem hiding this comment.
stop-server.sh is broken. It looks for a server.lock file that setup.sh never creates. Also, grep -oP isn't portable and will fail on macOS also SKILL.md implies data injection is "automatic," but it requires explicit wrap() calls and dataset setup. Please clarify this.
aaronpowell
left a comment
There was a problem hiding this comment.
Per the skills spec the scripts should be in the scripts folder, not resources.
That said, the scripts I don't think are a good idea because they are not cross-OS compatible - if you want to use this Skill on Windows a Bash script isn't going to work.
Pull Request Checklist
npm startand verified thatREADME.mdis up to date.stagedbranch for this pull request.Description
Update eval-driven-dev skill with streamlined workflow.
Shorten the workflow steps and reduce the amount of setup required.
Type of Contribution
Additional Notes
By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.