Skip to content

Commit 2a4408e

Browse files
sjarmakclaude
andcommitted
fix: paired baseline+MCP execution and correct repo names in instructions
Runner (run_selected_tasks.sh): - Replace mode-sequential passes with per-task paired execution (_launch_task_pair) matching _common.sh's run_paired_configs pattern: baseline + MCP start simultaneously per task so timing is comparable and resource utilization is maximized - Dockerfile swap/restore moved to task-pair level: swap once before launching both configs, background watcher restores after BOTH complete (no race condition) Instructions (10 tasks): - Replace generic "org/repo-name" placeholder with exact repo identifiers the oracle uses (e.g. "sg-benchmarks/kubernetes-client-go", "sg-benchmarks/grafana-loki") - Add "Important: Use exact repo identifiers" note to all affected tasks - Fix stale commit SHAs: 8020fc4 → v0.32.0 (client-go), a3af38d → v3.3.4 (loki) - Root cause of 0.0 scores: baseline agents wrote GitHub repo names instead of sg-benchmarks mirror names that the oracle file_set_match check expects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e5b05f4 commit 2a4408e

File tree

11 files changed

+141
-89
lines changed

11 files changed

+141
-89
lines changed

benchmarks/ccb_mcp_crossorg/ccx-crossorg-061/instruction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Create a file at `/workspace/answer.json` with your findings:
3838
{
3939
"symbols": [
4040
{
41-
"repo": "org/repo-name",
41+
"repo": "kubernetes/kubernetes",
4242
"path": "relative/path/to/file.go",
4343
"symbol": "StructName"
4444
}
@@ -47,6 +47,8 @@ Create a file at `/workspace/answer.json` with your findings:
4747
}
4848
```
4949

50+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The oracle expects entries for `kubernetes/kubernetes` and `grafana/grafana`. The `repo` field must match these exactly.
51+
5052
## Evaluation
5153

5254
Your answer is evaluated on:

benchmarks/ccb_mcp_crossrepo_tracing/ccx-config-trace-010/instruction.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ directories, but the authoritative source lives in separate repositories accessi
2828

2929
The local `/workspace/` directory contains all repositories:
3030
- `kubernetes/kubernetes` at v1.32.0 → `/workspace/kubernetes`
31-
- `kubernetes/client-go` at 8020fc4`/workspace/client-go`
31+
- `kubernetes/client-go` at v0.32.0`/workspace/client-go`
3232
- `kubernetes/api` at fa23dd3 → `/workspace/api`
3333
- `etcd-io/etcd` at v3.5.17 → `/workspace/etcd`
3434

@@ -39,12 +39,14 @@ Create a file at `/workspace/answer.json` with your findings in the following st
3939
```json
4040
{
4141
"symbols": [
42-
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
42+
{"repo": "sg-benchmarks/kubernetes-client-go", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
4343
],
4444
"text": "Explanation of where Config is defined, the package import path, and why this is the authoritative source."
4545
}
4646
```
4747

48+
**Important**: The local `/workspace/client-go` directory contains the `kubernetes/client-go` source, but in Sourcegraph it is indexed as `sg-benchmarks/kubernetes-client-go`. Use `sg-benchmarks/kubernetes-client-go` as the `repo` value in your answer — the oracle checks for this exact identifier.
49+
4850
Your answer is evaluated against a closed-world oracle — the exact repo, path, and symbol name matter.
4951

5052
## Evaluation

benchmarks/ccb_mcp_crossrepo_tracing/ccx-dep-trace-001/instruction.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ that directly imports it (not just subpackages) will be affected by a breaking A
2222

2323
The local `/workspace/` directory contains all repositories:
2424
- `kubernetes/kubernetes` at v1.32.0 → `/workspace/kubernetes`
25-
- `kubernetes/client-go` at 8020fc4`/workspace/client-go`
25+
- `kubernetes/client-go` at v0.32.0`/workspace/client-go`
2626

2727
## Output Format
2828

@@ -31,12 +31,14 @@ Create a file at `/workspace/answer.json` with your findings in the following st
3131
```json
3232
{
3333
"files": [
34-
{"repo": "org/repo-name", "path": "relative/path/to/file.go"}
34+
{"repo": "sg-benchmarks/kubernetes-client-go", "path": "relative/path/to/file.go"}
3535
],
3636
"text": "Narrative explanation of your findings, citing repos and file paths."
3737
}
3838
```
3939

40+
**Important**: Use `"repo": "sg-benchmarks/kubernetes-client-go"` exactly — this is the canonical repo identifier used by the evaluation oracle. The local checkout at `/workspace/client-go` corresponds to this repo.
41+
4042
Include only the `files` field. Your answer is evaluated against a closed-world oracle — completeness matters.
4143

4244
## Evaluation

benchmarks/ccb_mcp_crossrepo_tracing/ccx-dep-trace-004/instruction.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ adding observability, or extending the query pipeline.
2424

2525
The local `/workspace/` directory contains all repositories:
2626
- `grafana/grafana` at v11.4.0 → `/workspace/grafana`
27-
- `grafana/loki` at a3af38d`/workspace/loki`
27+
- `grafana/loki` at v3.3.4`/workspace/loki`
2828

2929
## Output Format
3030

@@ -33,12 +33,18 @@ Create a file at `/workspace/answer.json` with your findings in the following st
3333
```json
3434
{
3535
"chain": [
36-
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "TypeOrFunctionName"}
36+
{"repo": "grafana/grafana", "path": "relative/path/to/file.go", "symbol": "TypeOrFunctionName"}
3737
],
3838
"text": "Narrative explanation of the call chain, citing specific repos and file paths."
3939
}
4040
```
4141

42+
**Important**: Use exact repo identifiers as they appear in the oracle:
43+
- For Grafana: `"repo": "grafana/grafana"`
44+
- For Loki: `"repo": "sg-benchmarks/grafana-loki"`
45+
46+
The local checkout at `/workspace/loki` corresponds to `sg-benchmarks/grafana-loki`.
47+
4248
List the chain steps in order from Grafana (caller) to Loki (callee). Your answer is evaluated
4349
against a closed-world oracle — precision matters.
4450

benchmarks/ccb_mcp_incident/ccx-incident-031/instruction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Create a file at `/workspace/answer.json` with your findings:
5050
{
5151
"files": [
5252
{
53-
"repo": "org/repo-name",
53+
"repo": "etcd-io/etcd",
5454
"path": "relative/path/to/file.go",
5555
"function": "FunctionName"
5656
}
@@ -59,6 +59,8 @@ Create a file at `/workspace/answer.json` with your findings:
5959
}
6060
```
6161

62+
**Important**: Use `etcd-io/etcd` as the exact `repo` identifier in your answer. The oracle checks for files `server/storage/mvcc/kvstore.go` and `server/storage/mvcc/kvstore_txn.go` in `etcd-io/etcd`. Do not cite vendored copies in `kubernetes/kubernetes`.
63+
6264
## Evaluation
6365

6466
Your answer will be scored on:

benchmarks/ccb_mcp_onboarding/ccx-explore-042-ds/instruction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Create a file at `/workspace/answer.json` with your findings:
4141
{
4242
"chain": [
4343
{
44-
"repo": "org/repo-name",
44+
"repo": "numpy/numpy",
4545
"path": "relative/path/to/file.py",
4646
"symbol": "FunctionOrClassName",
4747
"description": "What role this plays in the data flow"
@@ -51,6 +51,8 @@ Create a file at `/workspace/answer.json` with your findings:
5151
}
5252
```
5353

54+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The oracle expects `repo` values of `numpy/numpy` (array layer), `pandas-dev/pandas` (data structure layer), and `scipy/scipy` (scientific computation layer). The `repo` field must match these exactly.
55+
5456
The `chain` should contain at least 3 steps representing the 3 layers described above.
5557

5658
## Evaluation

benchmarks/ccb_mcp_onboarding/ccx-onboard-050-ds/instruction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Create a file at `/workspace/answer.json` with your findings:
3636
{
3737
"chain": [
3838
{
39-
"repo": "org/repo-name",
39+
"repo": "sg-benchmarks/kubernetes-client-go",
4040
"path": "relative/path/to/file.go",
4141
"symbol": "FunctionOrTypeName",
4242
"description": "What this step does in the flow"
@@ -46,6 +46,8 @@ Create a file at `/workspace/answer.json` with your findings:
4646
}
4747
```
4848

49+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The oracle expects `repo` values of `sg-benchmarks/kubernetes-client-go` (client layer), `kubernetes/kubernetes` (API server layer), and `etcd-io/etcd` (storage layer). The `repo` field must match these exactly.
50+
4951
The `chain` should contain at least 3 steps representing the 3 layers described above.
5052

5153
## Evaluation

benchmarks/ccb_mcp_platform/ccx-explore-091-ds/instruction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Create a file at `/workspace/answer.json` with your findings:
4141
{
4242
"files": [
4343
{
44-
"repo": "org/repo-name",
44+
"repo": "sg-benchmarks/kubernetes-api",
4545
"path": "relative/path/to/file.go",
4646
"description": "What this file contains and its role in the deployment pattern"
4747
}
@@ -50,6 +50,8 @@ Create a file at `/workspace/answer.json` with your findings:
5050
}
5151
```
5252

53+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The oracle expects entries for `sg-benchmarks/kubernetes-api` (API type definitions) and `sg-benchmarks/kubernetes-client-go` (client examples and docs). The `repo` field must match these exactly.
54+
5355
The `files` list should include at least 3 files across 2+ repos that together define
5456
the canonical service deployment pattern.
5557

benchmarks/ccb_mcp_security/ccx-vuln-remed-011/instruction.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ that declare **`cookie` as a direct runtime dependency** (listed under
1111
`"dependencies"`, not `"devDependencies"`).
1212

1313
For each match, report:
14-
- The repository (`org/repo-name`)
14+
- The repository (e.g., `sg-benchmarks/expressjs-express`)
1515
- The file path within the repository
1616
- The version constraint declared for `cookie`
1717

@@ -38,7 +38,7 @@ Create a file at `/workspace/answer.json` with your findings:
3838
{
3939
"files": [
4040
{
41-
"repo": "org/repo-name",
41+
"repo": "sg-benchmarks/expressjs-express",
4242
"path": "relative/path/to/package.json",
4343
"version": "the-version-constraint"
4444
}
@@ -47,6 +47,8 @@ Create a file at `/workspace/answer.json` with your findings:
4747
}
4848
```
4949

50+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The repos to search are `nodejs/node`, `sg-benchmarks/expressjs-express`, `sg-benchmarks/lodash`, and `sg-benchmarks/prisma-prisma`. Note: the local `/workspace/express` directory maps to `sg-benchmarks/expressjs-express` in Sourcegraph — use `sg-benchmarks/expressjs-express` as the `repo` value in your answer.
51+
5052
Include only entries where `cookie` appears under `"dependencies"` (not `"devDependencies"`
5153
or `"scripts"`). Your answer is evaluated against a closed-world oracle — completeness matters.
5254

benchmarks/ccb_mcp_security/ccx-vuln-remed-014/instruction.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ and/or Mimir) that register HTTP handlers or routes **without applying
1212
authentication middleware** such as `middleware.AuthenticateUser`.
1313

1414
For each file found, report:
15-
- The repository (`org/repo-name`)
15+
- The repository (e.g., `sg-benchmarks/grafana-loki`)
1616
- The file path within the repository
1717
- The specific endpoint path(s) registered without authentication
1818

@@ -38,7 +38,7 @@ to surface.
3838

3939
The local `/workspace/` directory contains all repositories:
4040
- `grafana/grafana` at v11.4.0 → `/workspace/grafana`
41-
- `grafana/loki` at a3af38d`/workspace/loki`
41+
- `grafana/loki` at v3.3.4`/workspace/loki`
4242

4343
## Output Format
4444

@@ -48,7 +48,7 @@ Create a file at `/workspace/answer.json` with your findings:
4848
{
4949
"files": [
5050
{
51-
"repo": "org/repo-name",
51+
"repo": "sg-benchmarks/grafana-loki",
5252
"path": "relative/path/to/file.go",
5353
"endpoints": ["/endpoint/path/1", "/endpoint/path/2"]
5454
}
@@ -57,6 +57,8 @@ Create a file at `/workspace/answer.json` with your findings:
5757
}
5858
```
5959

60+
**Important**: Use exact repo identifiers as they appear in Sourcegraph. The MCP-only repos to audit are `sg-benchmarks/grafana-loki` and `sg-benchmarks/grafana-mimir`. Note: the local `/workspace/loki` directory maps to `sg-benchmarks/grafana-loki` in Sourcegraph — use `sg-benchmarks/grafana-loki` as the `repo` value in your answer. The oracle checks for `pkg/dataobj/explorer/service.go` in `sg-benchmarks/grafana-loki`.
61+
6062
Your answer is evaluated against a closed-world oracle — completeness matters.
6163

6264
## Evaluation

0 commit comments

Comments
 (0)