update eval-driven-dev skill by yiouli · Pull Request #1333 · github/awesome-copilot

yiouli · 2026-04-08T15:21:05Z

Pull Request Checklist

I have read and followed the CONTRIBUTING.md guidelines.
I have read and followed the Guidance for submissions involving paid services.
My contribution adds a new instruction, prompt, agent, skill, or workflow file in the correct directory.
The file follows the required naming convention.
The content is clearly structured and follows the example format.
I have tested my instructions, prompt, agent, skill, or workflow with GitHub Copilot.
I have run npm start and verified that README.md is up to date.
I am targeting the staged branch for this pull request.

Description

Update eval-driven-dev skill with streamlined workflow.

Shorten the workflow steps and reduce the amount of setup required.

Type of Contribution

Additional Notes

By submitting this pull request, I confirm that my contribution abides by the Code of Conduct and will be licensed under the MIT License.

github-actions · 2026-04-08T15:23:55Z

🔍 Skill Validator Results

2 resource(s) checked | ✅ All checks passed

Full output

Found 1 skill(s)
[eval-driven-dev] 📊 eval-driven-dev: 2,484 BPE tokens [chars/4: 2,730] (detailed ✓), 12 sections, 3 code blocks
[eval-driven-dev]    ⚠  No numbered workflow steps — agents follow sequenced procedures more reliably.
�[32m✅ All checks passed (1 skill(s))�[0m

Copilot

Pull request overview

This PR updates the eval-driven-dev skill to a more streamlined, wrap()-based workflow aligned with newer pixie-qa concepts, adding runnable/evaluator/dataset references and helper scripts to reduce manual setup.

Changes:

Updates the core skill workflow from @observe/manual mocking guidance to wrap() + Runnable + dataset-driven injection, with new step references and artifacts.
Adds resources/setup.sh and resources/stop-server.sh to automate environment/package setup and web UI lifecycle.
Replaces older reference docs with new auto-generated API references (wrap-api.md, testing-api.md, evaluators.md) and step-specific guides (Steps 1–6).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
skills/eval-driven-dev/SKILL.md	Updates the skill’s workflow steps, checkpoints, and setup guidance for `wrap()`/Runnable-based evals.
skills/eval-driven-dev/resources/setup.sh	Adds a setup script to update the skill, install/upgrade `pixie-qa`, init `pixie_qa/`, and start the web UI server.
skills/eval-driven-dev/resources/stop-server.sh	Adds a stop script intended to kill the web UI server and clean up lock state.
skills/eval-driven-dev/resources/run-with-timeout.sh	Adds a generic helper to run a command in the background and auto-kill it after a timeout.
skills/eval-driven-dev/references/wrap-api.md	Adds auto-generated reference for `wrap()`, `Runnable`, and related CLI commands/errors.
skills/eval-driven-dev/references/testing-api.md	Adds auto-generated reference for dataset/test execution APIs and dataset JSON format.
skills/eval-driven-dev/references/evaluators.md	Adds auto-generated catalog of built-in evaluators and `create_llm_evaluator` guidance.
skills/eval-driven-dev/references/1-a-entry-point.md	Adds Step 1a reference for documenting app entry point and execution flow.
skills/eval-driven-dev/references/1-b-eval-criteria.md	Adds Step 1b reference for defining use cases and eval criteria.
skills/eval-driven-dev/references/2-wrap-and-trace.md	Adds Step 2 reference for adding `wrap()` points, implementing a Runnable, and capturing a reference trace.
skills/eval-driven-dev/references/3-define-evaluators.md	Adds Step 3 reference for mapping criteria to evaluators and producing an evaluator mapping artifact.
skills/eval-driven-dev/references/4-build-dataset.md	Adds Step 4 reference for building dataset JSON using `entry_kwargs` + `test_case.eval_input` + evaluator assignment.
skills/eval-driven-dev/references/5-run-tests.md	Adds Step 5 reference for running `pixie test`, fixing dataset/harness errors, and running `pixie analyze`.
skills/eval-driven-dev/references/6-investigate.md	Updates investigation guidance (including a new “stop/continue” gate and analysis-first workflow).
skills/eval-driven-dev/references/understanding-app.md	Removes the older Step 1 reference doc (superseded by new Step 1a/1b docs).
skills/eval-driven-dev/references/instrumentation.md	Removes the older `@observe`/`enable_storage()` instrumentation guidance (superseded by `wrap()` docs).
skills/eval-driven-dev/references/run-harness-patterns.md	Removes older run-harness examples (superseded by Runnable guidance).
skills/eval-driven-dev/references/pixie-api.md	Removes older pixie API reference (superseded by new auto-generated refs).
skills/eval-driven-dev/references/eval-tests.md	Removes older eval test writing reference (superseded by Step 5 + API refs).
skills/eval-driven-dev/references/dataset-generation.md	Removes older dataset generation reference (superseded by Step 4 + API refs).
docs/README.skills.md	Updates the skill listing to reflect the new `wrap()` workflow and reference/resource files.

Comments suppressed due to low confidence (2)

skills/eval-driven-dev/references/6-investigate.md:40

Minor wording issue: “The the captured output” has an extra “the”. Suggest changing to “The captured output” (or “eval_output / captured output”) for clarity.
skills/eval-driven-dev/references/6-investigate.md:110
Minor wording issue in the example: “the captured output” has an extra “the”. Suggest changing to “captured output” (or “eval_output / captured output”) to avoid confusing readers.

Copilot · 2026-04-08T15:26:19Z

skills/eval-driven-dev/SKILL.md

 **What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.

-**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.
+**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time.


The statement that external data sources are “handled automatically by wrap(purpose="input") at test time” is a bit misleading: this only works for dependencies that have been explicitly wrapped and for which each dataset entry provides the corresponding test_case.eval_input items. Consider rewording to clarify that external dependencies must be wrapped (and populated via the dataset) for injection to happen; otherwise they still need traditional mocking or will be called for real.

Suggested change

**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are handled automatically by `wrap(purpose="input")` at test time.

**What's in scope**: the app's entire code path from entry point to response — never mock or skip any part of it. **What's out of scope**: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — these are only injected automatically at test time if the dependency has been explicitly wrapped with `wrap(purpose="input")` and each dataset entry provides the corresponding `test_case.eval_input` values. If an external dependency is not wrapped and populated this way, you still need traditional mocking or it will be called for real.

Copilot · 2026-04-08T15:26:20Z

skills/eval-driven-dev/SKILL.md

+## Web Server Management

-### Imports
+pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script, and needs to be explicitly cleaned up when display is no longer needed.

-```python
-from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
-from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
-```
-
-Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.
-
-### CLI commands
+When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with the following command:

 ```bash
-uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
-uv run pixie trace list                 # List captured traces
-uv run pixie trace last                 # Show most recent trace
-uv run pixie trace show <id> --verbose  # Show specific trace
-uv run pixie dataset create <name>      # Create a new dataset
+bash resources/stop-server.sh
 ```

-### Directory layout
+And whenever you restart the workflow, always run the setup script again to ensure the web server is running:

+```bash
+bash resources/setup.sh
 ```


resources/run-with-timeout.sh is added to the skill but isn’t referenced anywhere in SKILL.md or the step reference docs. Either document when/how to use it (so the bundled asset is discoverable) or remove it to avoid carrying unused tooling in the skill bundle.

Copilot · 2026-04-08T15:26:20Z

skills/eval-driven-dev/resources/stop-server.sh

+PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
+LOCK_FILE="${PIXIE_ROOT}/server.lock"
+
+if [ ! -f "$LOCK_FILE" ]; then
+  echo "No server.lock found at ${LOCK_FILE} — server may not be running."
+  exit 0
+fi
+
+PORT=$(cat "$LOCK_FILE")
+echo "Found server.lock with port ${PORT}"
+
+# Find the process listening on that port and kill it
+PID=$(lsof -ti "tcp:${PORT}" 2>/dev/null || ss -tlnp "sport = :${PORT}" 2>/dev/null | grep -oP 'pid=\K[0-9]+' || true)
+


stop-server.sh assumes ${PIXIE_ROOT}/server.lock exists and contains the server port, but nothing in this skill writes that file (and there are no other references to it in-repo). Unless pixie start itself creates server.lock, this script will be a no-op. Consider switching to a PID-file approach (write $! to ${PIXIE_ROOT}/server.pid in setup.sh and kill that PID here), or explicitly create/maintain server.lock in setup.sh.

Also, grep -oP (PCRE) is not portable across environments (e.g., macOS default grep), so the ss | grep -oP ... fallback is likely to fail on some platforms.

Copilot · 2026-04-08T15:26:21Z

skills/eval-driven-dev/resources/setup.sh

+echo "=== Starting web UI server (background) ==="
+PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
+if [ -f uv.lock ]; then
+  nohup uv run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+elif [ -f poetry.lock ]; then
+  nohup poetry run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+else
+  nohup pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+fi
+echo "Web UI server started (PID $!, log: ${PIXIE_ROOT}/server.log)"


The setup script starts the web UI server in the background but doesn’t write any metadata that stop-server.sh can use to stop it (it currently looks for ${PIXIE_ROOT}/server.lock). Consider persisting the background PID (e.g., ${PIXIE_ROOT}/server.pid) and/or the chosen port so the stop script can reliably clean up the correct process across platforms/shell sessions.

utkarsh232005

stop-server.sh is broken. It looks for a server.lock file that setup.sh never creates. Also, grep -oP isn't portable and will fail on macOS also SKILL.md implies data injection is "automatic," but it requires explicit wrap() calls and dataset setup. Please clarify this.

aaronpowell

Per the skills spec the scripts should be in the scripts folder, not resources.

That said, the scripts I don't think are a good idea because they are not cross-OS compatible - if you want to use this Skill on Windows a Bash script isn't going to work.

yiouli requested a review from aaronpowell as a code owner April 8, 2026 15:21

Copilot AI review requested due to automatic review settings April 8, 2026 15:21

Copilot started reviewing on behalf of yiouli April 8, 2026 15:21 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

utkarsh232005 reviewed Apr 8, 2026

View reviewed changes

aaronpowell requested changes Apr 9, 2026

View reviewed changes

yiouli closed this Apr 9, 2026

yiouli force-pushed the staged branch from da8e323 to 017f31f Compare April 9, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update eval-driven-dev skill#1333

update eval-driven-dev skill#1333
yiouli wants to merge 0 commit intogithub:stagedfrom
yiouli:staged

yiouli commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

utkarsh232005 left a comment

Uh oh!

aaronpowell left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yiouli commented Apr 8, 2026

Pull Request Checklist

Description

Type of Contribution

Additional Notes

Uh oh!

github-actions bot commented Apr 8, 2026

🔍 Skill Validator Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

utkarsh232005 left a comment

Choose a reason for hiding this comment

Uh oh!

aaronpowell left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants