fix: ABC audit fixes for MCP-unique suites — test.sh and R.2 false positive

sjarmak · claude · sjarmak · commit b9fee1614cf1 · 2026-02-28T22:19:07.000Z
- Add missing tests/test.sh to 4 pre-existing MCP-unique tasks that were
  missing the Harbor compatibility wrapper: ccx-compliance-057,
  ccx-platform-091, ccx-onboard-042, ccx-onboard-050
- Fix abc_audit.py R.2 check to skip for ccb_mcp_* suites; instruction.md
  referencing Sourcegraph MCP tools is correct behavior for MCP-unique tasks,
  not contamination
- All 11 ccb_mcp_* suites now grade A (were D/F) with 0 critical failures
- 10/10 smoke tests pass across all newly expanded suites (Docker build + verifier)

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/ccb_mcp_compliance/ccx-compliance-057/tests/test.sh b/benchmarks/ccb_mcp_compliance/ccx-compliance-057/tests/test.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# test.sh — Harbor compatibility wrapper
+# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
+# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
+
+# sg_only_env: restore full repo before verification (no-op for regular runs)
+[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
+
+exec bash "$(dirname "$0")/eval.sh" "$@"
diff --git a/benchmarks/ccb_mcp_onboarding/ccx-onboard-042/tests/test.sh b/benchmarks/ccb_mcp_onboarding/ccx-onboard-042/tests/test.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# test.sh — Harbor compatibility wrapper
+# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
+# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
+
+# sg_only_env: restore full repo before verification (no-op for regular runs)
+[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
+
+exec bash "$(dirname "$0")/eval.sh" "$@"
diff --git a/benchmarks/ccb_mcp_onboarding/ccx-onboard-050/tests/test.sh b/benchmarks/ccb_mcp_onboarding/ccx-onboard-050/tests/test.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# test.sh — Harbor compatibility wrapper
+# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
+# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
+
+# sg_only_env: restore full repo before verification (no-op for regular runs)
+[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
+
+exec bash "$(dirname "$0")/eval.sh" "$@"
diff --git a/benchmarks/ccb_mcp_platform/ccx-platform-091/tests/test.sh b/benchmarks/ccb_mcp_platform/ccx-platform-091/tests/test.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# test.sh — Harbor compatibility wrapper
+# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
+# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
+
+# sg_only_env: restore full repo before verification (no-op for regular runs)
+[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
+
+exec bash "$(dirname "$0")/eval.sh" "$@"
diff --git a/scripts/abc_audit.py b/scripts/abc_audit.py
@@ -955,6 +955,15 @@ def audit_suite(suite: str, dimension: Optional[Dimension] = None) -> AuditRepor
             ))
             continue
 
+        # R.2 doesn't apply to MCP-unique suites: instructions intentionally
+        # reference Sourcegraph MCP tools (that's the point of these tasks).
+        if cid == "R.2" and suite.startswith("ccb_mcp_"):
+            report.results.append(CriterionResult(
+                criterion_id=cid, status=Status.SKIP,
+                evidence="MCP-unique suite: MCP tool references in instructions are by design",
+            ))
+            continue
+
         # Run automated check
         if cid in TASK_CHECKS:
             result = TASK_CHECKS[cid](tasks)