WebDetective is an evaluation framework for assessing large language models' ability to perform multi-hop question answering using Wikipedia as the knowledge source. The benchmark evaluates models on their search capabilities, knowledge utilization, and answer generation quality.
- Multi-hop Question Answering: Evaluate models on questions requiring multiple reasoning steps across Wikipedia articles
- Two Evaluation Modes: Simple (accuracy-focused) and holistic (comprehensive metrics)
- Flexible Wikipedia Access: Support for MediaWiki API, local LMDB database, or online wiki services
- Function Calling Support: Test models with and without function calling capabilities
- Comprehensive Metrics: Track search scores, knowledge utilization, refusal behavior, and more
- Installation
- Quick Start
- Model Configuration
- Environment Variables
- Wikipedia Access Methods
- Running Evaluation
- Understanding Results
- Converting Results to Excel
- Evaluation Metrics
- Python 3.10
- Conda (recommended)
-
Create a conda environment:
conda create -n WebDetective python=3.10 conda activate WebDetective
-
Install dependencies:
pip install -r requirement.txt
-
Download NLTK data (if needed):
import nltk nltk.download('punkt')
Here's a minimal example to get started:
# Set up your evaluation
bash scripts/run.sh \
"your-model-path" \
"http://your-model-endpoint/v1" \
"your-api-key" \
"mask_wiki_200.jsonl" \
"./output" \
"holistic"Before evaluating your model, you need to configure it in utils/constant.py. Add your model to the appropriate category:
-
FC_MODELS (Function Calling Models)
- Models that support function calling
- Examples: GPT-4, Claude-3, etc.
FC_MODELS = [ "gpt-4o-2024-11-20", "claude-3-7-sonnet-20250219", "your-model-name", # Add your model here ]
-
NONTEMP_MODELS (Non-Temperature Models)
- Models that don't support temperature parameter
- Examples: o1, o3, reasoning models
NONTEMP_MODELS = [ "o1-2024-12-17", "o3-mini-2025-01-31", "your-reasoning-model", # Add your model here ]
-
REASONING_MODELS (Reasoning Effort Models)
- Models that support
reasoning_effortparameter - Examples: o-series models
REASONING_MODELS = [ "o1-2024-12-17", "o3-2025-04-16", "your-reasoning-model", # Add your model here ]
- Models that support
Configure these variables in scripts/run.sh:
| Variable | Description | Example |
|---|---|---|
MODEL_PATH |
Path or identifier of your model | "gpt-4o-2024-11-20" |
MODEL_ADDRESS |
API endpoint for your model | "http://0.0.0.0:6001/v1" |
MODEL_APIKEY |
API key for authentication | "your-api-key" |
DATASET |
Evaluation dataset file | "mask_wiki_200.jsonl" |
OUTPUT_PATH |
Directory to save results | "./eval_output" |
EVAL_MODE |
Evaluation mode | "simple" or "holistic" |
| Variable | Default | Description |
|---|---|---|
MAX_CONTEXT_SIZE |
31*1024-500 |
Maximum context window size |
TEMPERATURE |
0.6 |
Generation temperature (ignored for NONTEMP_MODELS) |
REASONING_EFFORT |
"high" |
Reasoning effort level: "low", "medium", or "high" |
WEBCONTENT_MAXLENGTH |
150000 |
Maximum length of web content to process |
MAX_WORKERS |
10 |
Number of parallel workers (be careful with local wiki) |
ROLLOUT_COUNT |
1 |
Number of evaluation rounds |
NLTK_DATA |
"path/to/nltk_data" |
Path to NLTK data directory |
| Variable | Description | Required |
|---|---|---|
SERPER_API_KEY |
API key for Serper search service | Yes |
SEARCH_API_URL |
Search API endpoint | Default: "http://google.serper.dev/search" |
| Variable | Description | Default |
|---|---|---|
WIKI_URL_PREFIX |
Wikipedia URL prefix | "https://en.wikipedia.org/wiki/" |
WIKI_LMDB_PATH |
Path to local LMDB database | "EMPTY" |
WIKI_REQUEST_URL |
Online wiki service URL | "EMPTY" |
WIKI_REQUEST_AUTH_TOKEN |
Auth token for online service | "EMPTY" |
| Variable | Description | Default |
|---|---|---|
SUMMARY_MODEL_PATH |
Model for summarization | "path/to/Qwen2.5-72B-Instruct" |
SUMMARY_MODEL_ADDRESS |
Summary model endpoint | "http://0.0.0.0:6002/v1" |
SUMMARY_MODEL_APIKEY |
Summary model API key | "EMPTY" |
WebDetective supports three methods to access Wikipedia content, each with different trade-offs:
When to use: Quick testing, small-scale evaluation
Setup: No setup required. Used automatically if both WIKI_LMDB_PATH and WIKI_REQUEST_URL are set to "EMPTY".
Pros:
- No setup or downloads required
- Always uses latest Wikipedia content
Cons:
- Rate-limited by Wikipedia
- Slow for large-scale evaluations
- May experience timeouts
export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="EMPTY"When to use: Large-scale evaluations, best performance
Setup:
- Download December 2024 Wikipedia dump
- Convert to LMDB format (preprocessing script required)
- Point to the LMDB directory
Pros:
- Very fast read speeds
- No rate limiting
- Offline access
Cons:
- Initial setup time required
- Large disk space needed (~50GB+)
- Static snapshot (December 2024)
export WIKI_LMDB_PATH="/path/to/wiki_lmdb_database"
export WIKI_REQUEST_URL="EMPTY"When to use: Medium-scale evaluations, when local setup is not feasible
Setup: Deploy or subscribe to an online Wikipedia API service
Pros:
- Stable and reliable
- No local storage required
- Can be updated
Cons:
- Costs money (pay per request)
- Requires internet connection
- Depends on service availability
export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="https://your-wiki-service.com/api"
export WIKI_REQUEST_AUTH_TOKEN="your-auth-token"bash scripts/run.sh \
<MODEL_PATH> \
<MODEL_ADDRESS> \
<MODEL_APIKEY> \
<DATASET> \
<OUTPUT_PATH> \
<EVAL_MODE>bash scripts/run.sh \
"gpt-4o-2024-11-20" \
"https://api.openai.com/v1" \
"sk-your-openai-key" \
"mask_wiki_200.jsonl" \
"./eval_output" \
"simple"bash scripts/run.sh \
"/path/to/local-model" \
"http://0.0.0.0:6001/v1" \
"EMPTY" \
"mask_wiki_200.jsonl" \
"./eval_output" \
"holistic"The script automatically detects vLLM format addresses and starts the server if needed:
bash scripts/run.sh \
"/path/to/model" \
"http://0.0.0.0:6001/v1" \
"EMPTY" \
"mask_wiki_200.jsonl" \
"./eval_output" \
"holistic"After evaluation, results are organized as follows:
OUTPUT_PATH/
├── <model_name>/
│ └── <dataset_name>/
│ ├── iter1.jsonl # First rollout results
│ ├── iter2.jsonl # Second rollout results (if ROLLOUT_COUNT > 1)
│ ├── iter3.jsonl # Third rollout results (if ROLLOUT_COUNT > 1)
│ └── eval_results.json # Detailed evaluation metrics
└── <dataset_name>_summary.jsonl # Summary of all evaluations
-
iter{N}.jsonl: Raw model outputs for each rollout
- Contains full conversation history
- Tool calls and responses
- Final predictions
- Metadata for each question
-
eval_results.json: Detailed metrics per round
- Per-question evaluation results
- Correctness judgments
- Search and generation scores
- Knowledge utilization metrics
-
{dataset}_summary.jsonl: Aggregated metrics
- Overall performance statistics
- Averaged across all rollouts
- Ready for analysis and comparison
For easier analysis, convert the summary to Excel format:
-
Edit
wikieval_analysis.py:dataset = "mask_wiki_200.jsonl" summary_path = f"./eval_output/{dataset}_summary.jsonl" excel_path = f"./eval_output/{dataset}_summary.xlsx"
-
Run the conversion:
python wikieval_analysis.py
-
View results: Open the generated Excel file to see formatted metrics with:
- Model comparisons
- Performance across difficulty levels
- Detailed statistics
- Pass@1: Percentage of questions answered correctly on first attempt
- Pass@K: Percentage answered correctly in any of K attempts
- Best Pass@1: Best single-round performance
- Avg Pass@1: Average performance across all rounds
-
Search Score (0-100%)
- Measures quality of information retrieval
- Whether model found sufficient evidence to answer
- Higher is better
-
Generation Score (0-100%)
- Combined measure of knowledge utilization and appropriate refusal
- Formula:
0.5 * (Knowledge_Util_F1 + Good_Refusal_F1) * Knowledge_Sufficient_Rate - Balances correct answering with knowing when to refuse
-
Pass@K: Overall accuracy metrics
- Same as simple mode but with additional context
-
Good Refusal Metrics:
- Recall: % of insufficient-knowledge cases where model correctly refused
- Precision: % of refusals that were appropriate
- F1: Harmonic mean of recall and precision
-
Knowledge Utilization Metrics:
- Recall: % of sufficient-knowledge cases answered correctly
- Precision: % of non-refusal cases that were correct
- F1: Harmonic mean of recall and precision
-
Knowledge Sufficiency (0-100%):
- % of questions where retrieved information was sufficient
- Combines search results and parametric knowledge
-
RAG Knowledge Sufficiency (0-100%):
- % of questions answerable from retrieved info alone
- Excludes parametric knowledge
-
Knowledge Forget to Use (0-100%):
- % of cases where model had sufficient knowledge but failed to use it
- Lower is better
-
Knowledge Lead Astray (0-100%):
- % of cases where model had correct info but answered incorrectly
- Lower is better
- Avg Tool Use: Average number of tool calls per question
- Avg Visit Tool Use: Average number of Wikipedia page visits
- Avg Optimal Hop: Average number of reasoning hops in ground truth
Results are also broken down by difficulty:
- Easy: 2-3 hops required
- Medium: 3-5 hops required
- Hard: 5+ hops required
You can also run evaluation directly with Python:
from utils.react_agent import MultiTurnReactAgent
from utils.prompt import SYSTEM_PROMPT_MULTI
# Configure your model
llm_cfg = {
'model': 'your-model-path',
'generate_cfg': {
'temperature': 0.6,
'top_p': 0.95,
'num_retries': 10
}
}
# Create agent
agent = MultiTurnReactAgent(
llm=llm_cfg,
function_list=["search", "visit"],
system_message=SYSTEM_PROMPT_MULTI
)
# Run on your data
result = agent.run(data_item, user_prompt)To analyze individual questions:
import json
# Load results
with open('eval_output/model/dataset/iter1.jsonl', 'r') as f:
results = [json.loads(line) for line in f]
# Find specific question
for item in results:
if 'your search term' in item['question']:
print(f"Question: {item['question']}")
print(f"Prediction: {item['prediction']}")
print(f"Tool calls: {len([m for m in item['messages'] if m.get('tool_calls')])}")- Rate Limiting: If using MediaWiki API, reduce
MAX_WORKERSor use local LMDB - Timeouts: Increase timeout values in the code or reduce
WEBCONTENT_MAXLENGTH - Memory Issues: Reduce
MAX_CONTEXT_SIZEorMAX_WORKERS - vLLM Server Not Starting: Check CUDA availability and model path
For issues or questions:
- Check the error messages in console output
- Review
eval_results.jsonfor detailed error information - Examine individual
iter*.jsonlfiles for failed cases
If you use this benchmark in your research, please cite:
@misc{song2025demystifyingdeepsearchholistic,
title={Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics},
author={Maojia Song and Renhang Liu and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Soujanya Poria and Jingren Zhou},
year={2025},
eprint={2510.05137},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.05137},
}