Skip to content

Commit f350cd4

Browse files
xyliugozjwu0522
andauthored
✨ feat: introduce mcpmark-agent (#178)
Co-authored-by: zjwu0522 <[email protected]> Co-authored-by: Zijian Wu <[email protected]>
1 parent 8e7cf10 commit f350cd4

File tree

23 files changed

+1770
-1279
lines changed

23 files changed

+1770
-1279
lines changed

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ python -m pipeline \
8484
--tasks file_property/size_classification
8585
```
8686

87-
Results are saved to `./results/{exp_name}/{mcp}__{model}/{task}` (in this example `./results/test-run/filesystem__gpt-5/file_property__size_classification`).
87+
Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).
8888

8989
---
9090

@@ -152,7 +152,7 @@ You can also follow `docs/quickstart.md` for the shortest end-to-end path.
152152

153153
## Results and metrics
154154

155-
- Results are written to `./results/` (JSON + CSV).
155+
- Results are organized under `./results/{exp_name}/{model}__{mcp}/run-*/` (JSON + CSV per task).
156156
- Generate a summary with:
157157
```bash
158158
python -m src.aggregators.aggregate_results --exp-name exp
@@ -162,7 +162,9 @@ python -m src.aggregators.aggregate_results --exp-name exp
162162
---
163163

164164
## Model and Tasks
165-
- See `docs/introduction.md` for models supported in MCPMark.
165+
- **Model support**: MCPMark calls models via LiteLLM — see the LiteLLM docs: [`LiteLLM Doc`](https://docs.litellm.ai/docs/). For Anthropic (Claude) extended thinking mode (enabled via `--reasoning-effort`), we use Anthropic’s native API.
166+
- See `docs/introduction.md` for details and configuration of supported models in MCPMark.
167+
- To add a new model, edit `src/model_config.py`. Before adding, check LiteLLM supported models/providers. See [`LiteLLM Doc`](https://docs.litellm.ai/docs/).
166168
- Task design principles in `docs/datasets/task.md`. Each task ships with an automated `verify.py` for objective, reproducible evaluation, see `docs/task.md` for details.
167169

168170
---

pipeline.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,10 @@ def main():
6363
"--timeout", type=int, default=3600, help="Timeout in seconds for agent execution"
6464
)
6565
parser.add_argument(
66-
"--stream",
67-
action="store_true",
68-
default=False,
69-
help="Use streaming execution (default: False, uses non-streaming)",
66+
"--reasoning-effort",
67+
default="default",
68+
choices=["default", "minimal", "low", "medium", "high"],
69+
help="Reasoning effort level for supported models (default: None)",
7070
)
7171

7272
# Output configuration
@@ -113,12 +113,11 @@ def main():
113113
logger.info(f"Starting Run {run_idx}/{args.k}")
114114
logger.info(f"{'=' * 80}\n")
115115

116-
# For k-runs, create run-N subdirectory
116+
# For k-runs, results/{exp}/{mcp}__{model}/run-N
117117
run_exp_name = f"run-{run_idx}"
118118
run_output_dir = args.output_dir / args.exp_name
119119
else:
120-
# For single run (k=1), maintain backward compatibility
121-
# Use run-1 subdirectory for consistency
120+
# For single run, still use run-1 under service_model
122121
run_exp_name = "run-1"
123122
run_output_dir = args.output_dir / args.exp_name
124123

@@ -138,7 +137,7 @@ def main():
138137
timeout=args.timeout,
139138
exp_name=run_exp_name,
140139
output_dir=run_output_dir,
141-
stream=args.stream,
140+
reasoning_effort=args.reasoning_effort,
142141
)
143142

144143
pipeline.run_evaluation(args.tasks)

pyproject.toml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,13 @@ dependencies = [
1414
"python-dotenv>=1.1.1,<2",
1515
"ruff>=0.12.4,<0.13",
1616
"psycopg2-binary>=2.9.10,<3",
17-
"pyyaml>=6.0.2,<7"
18-
, "nest-asyncio>=1.6.0,<2", "pixi", "pipx>=1.7.1,<2", "pgdumplib>=3.1.0,<4"]
17+
"pyyaml>=6.0.2,<7",
18+
"nest-asyncio>=1.6.0,<2",
19+
"pixi",
20+
"pipx>=1.7.1,<2",
21+
"pgdumplib>=3.1.0,<4",
22+
"litellm==1.76.0"
23+
]
1924

2025
[build-system]
2126
build-backend = "hatchling.build"

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,5 @@ matplotlib>=3.7.0
88
numpy>=1.23.0
99
psycopg2
1010
pyyaml
11-
nest_asyncio
11+
nest_asyncio
12+
litellm==1.76.0

0 commit comments

Comments
 (0)