fix: benchmark fixtures unusable — wrong format and missing is_attack labels (issue #48) by ksek87 · Pull Request #49 · ksek87/fuzzd

ksek87 · 2026-05-25T16:24:59Z

Closes #48

Summary

Converts all three bench/ fixture files from wrapped objects {"_meta":…,"tools":[…]} to flat JSON arrays as fuzzd benchmark expects, one tool per line
Adds "is_attack": true to each tool in mcptox_representative.json and mcptox_actual.json (these are all attack samples)
Adds "is_attack": false to each tool in clean_tools.json
Adds five regression tests in main.rs that parse the fixtures at test time and assert: representative and actual fixtures have attack labels, clean has none, recall is 1.0 against the representative set, and combined precision stays ≥ 0.90

Why it was never caught

Two silent failure modes stacked:

The benchmark command expects a flat array but fixtures were wrapped objects — a parse error at runtime, but no test ever ran fuzzd benchmark --schema bench/*.json
LabelledTool.meta.is_attack has #[serde(default)] so a missing field silently becomes false — no error, just wrong numbers

Test plan

fuzzd benchmark --schema bench/mcptox_representative.json → Precision: 1.000, Recall: 1.000, F1: 1.000
fuzzd benchmark --schema bench/clean_tools.json → 20 true negatives, 0 false positives
cargo test benchmark_fixture_tests → 5 tests pass

https://claude.ai/code/session_01G4f8mN9SeSHSGY1dWfFzih

… labels (issue #48) Fixtures were wrapped objects {"_meta":…,"tools":[…]} and lacked is_attack labels on each tool, making `fuzzd benchmark --schema bench/*.json` either crash with a parse error or silently report garbage results (all detections as false positives because is_attack defaulted to false). - Convert all three bench fixtures to flat JSON arrays - Add is_attack:true to each tool in mcptox_representative/actual (attack corpus) - Add is_attack:false to each tool in clean_tools (benign corpus) - Add three regression tests in main.rs that parse the fixture files directly and assert: representative has attack labels, clean has none, and recall ≥ 1.0 against the representative set — so this format gap cannot regress silently https://claude.ai/code/session_01G4f8mN9SeSHSGY1dWfFzih

…ecision bound - actual_fixture_parses_and_has_attack_labels: verifies mcptox_actual.json (485 tools) parses correctly and every entry has is_attack=true; previously the largest fixture had zero test coverage - combined_benchmark_precision_within_bounds: runs the full attack+clean benchmark and asserts precision >= 0.90, locking in the current FP count so a regression that adds new false positives is caught immediately https://claude.ai/code/session_01G4f8mN9SeSHSGY1dWfFzih

…(v0.9 done) - README.md: mark v0.8 and v0.9 as Done in roadmap; remove their now-stale milestone detail sections; update signal table from 13 to 21 entries adding the 8 new v0.9 signals; update architecture diagram to 21 variants / 155 AC patterns; note inputSchema scanning in the scanner description - bench/README.md: update signal distribution header to 21 signals / 155 patterns; replace stale coverage gap notes for #34/#35 with a short done-status note; add the 8 new signals to the signal table; fix the "Adding to the benchmark" _meta example to use the new is_attack:true format instead of the old taxonomy fields https://claude.ai/code/session_01G4f8mN9SeSHSGY1dWfFzih

ksek87 force-pushed the claude/roadmap-ticket-planning-GEbPu branch from 64d3d06 to 649932e Compare May 25, 2026 18:42

claude added 2 commits May 25, 2026 18:50

ksek87 merged commit 6ec3baa into main May 25, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: benchmark fixtures unusable — wrong format and missing is_attack labels (issue #48)#49

fix: benchmark fixtures unusable — wrong format and missing is_attack labels (issue #48)#49
ksek87 merged 3 commits into
mainfrom
claude/roadmap-ticket-planning-GEbPu

ksek87 commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksek87 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it was never caught

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksek87 commented May 25, 2026 •

edited

Loading