WebDetective: Benchmark for Multi-hop Information Retrieval

WebDetective is an evaluation framework for assessing large language models' ability to perform multi-hop question answering using Wikipedia as the knowledge source. The benchmark evaluates models on their search capabilities, knowledge utilization, and answer generation quality.

Features

Multi-hop Question Answering: Evaluate models on questions requiring multiple reasoning steps across Wikipedia articles
Two Evaluation Modes: Simple (accuracy-focused) and holistic (comprehensive metrics)
Flexible Wikipedia Access: Support for MediaWiki API, local LMDB database, or online wiki services
Function Calling Support: Test models with and without function calling capabilities
Comprehensive Metrics: Track search scores, knowledge utilization, refusal behavior, and more

Installation

Prerequisites

Python 3.10
Conda (recommended)

Steps

Create a conda environment:

conda create -n WebDetective python=3.10
conda activate WebDetective

Install dependencies:
```
pip install -r requirement.txt
```
Download NLTK data (if needed):
```
import nltk
nltk.download('punkt')
```

Quick Start

Here's a minimal example to get started:

# Set up your evaluation
bash scripts/run.sh \
    "your-model-path" \
    "http://your-model-endpoint/v1" \
    "your-api-key" \
    "mask_wiki_200.jsonl" \
    "./output" \
    "holistic"

Model Configuration

Before evaluating your model, you need to configure it in utils/constant.py. Add your model to the appropriate category:

Model Categories

FC_MODELS (Function Calling Models)

Models that support function calling
Examples: GPT-4, Claude-3, etc.

FC_MODELS = [
    "gpt-4o-2024-11-20",
    "claude-3-7-sonnet-20250219",
    "your-model-name",  # Add your model here
]

NONTEMP_MODELS (Non-Temperature Models)

Models that don't support temperature parameter
Examples: o1, o3, reasoning models

NONTEMP_MODELS = [
    "o1-2024-12-17",
    "o3-mini-2025-01-31",
    "your-reasoning-model",  # Add your model here
]

REASONING_MODELS (Reasoning Effort Models)

Models that support reasoning_effort parameter
Examples: o-series models

REASONING_MODELS = [
    "o1-2024-12-17",
    "o3-2025-04-16",
    "your-reasoning-model",  # Add your model here
]

Environment Variables

Configure these variables in scripts/run.sh:

Required Variables

Variable	Description	Example
`MODEL_PATH`	Path or identifier of your model	`"gpt-4o-2024-11-20"`
`MODEL_ADDRESS`	API endpoint for your model	`"http://0.0.0.0:6001/v1"`
`MODEL_APIKEY`	API key for authentication	`"your-api-key"`
`DATASET`	Evaluation dataset file	`"mask_wiki_200.jsonl"`
`OUTPUT_PATH`	Directory to save results	`"./eval_output"`
`EVAL_MODE`	Evaluation mode	`"simple"` or `"holistic"`

Optional Variables

Variable	Default	Description
`MAX_CONTEXT_SIZE`	`31*1024-500`	Maximum context window size
`TEMPERATURE`	`0.6`	Generation temperature (ignored for NONTEMP_MODELS)
`REASONING_EFFORT`	`"high"`	Reasoning effort level: `"low"`, `"medium"`, or `"high"`
`WEBCONTENT_MAXLENGTH`	`150000`	Maximum length of web content to process
`MAX_WORKERS`	`10`	Number of parallel workers (be careful with local wiki)
`ROLLOUT_COUNT`	`1`	Number of evaluation rounds
`NLTK_DATA`	`"path/to/nltk_data"`	Path to NLTK data directory

Search Configuration

Variable	Description	Required
`SERPER_API_KEY`	API key for Serper search service	Yes
`SEARCH_API_URL`	Search API endpoint	Default: `"http://google.serper.dev/search"`

Wikipedia Access Configuration

Variable	Description	Default
`WIKI_URL_PREFIX`	Wikipedia URL prefix	`"https://en.wikipedia.org/wiki/"`
`WIKI_LMDB_PATH`	Path to local LMDB database	`"EMPTY"`
`WIKI_REQUEST_URL`	Online wiki service URL	`"EMPTY"`
`WIKI_REQUEST_AUTH_TOKEN`	Auth token for online service	`"EMPTY"`

Summary Model Configuration

Variable	Description	Default
`SUMMARY_MODEL_PATH`	Model for summarization	`"path/to/Qwen2.5-72B-Instruct"`
`SUMMARY_MODEL_ADDRESS`	Summary model endpoint	`"http://0.0.0.0:6002/v1"`
`SUMMARY_MODEL_APIKEY`	Summary model API key	`"EMPTY"`

Wikipedia Access Methods

WebDetective supports three methods to access Wikipedia content, each with different trade-offs:

1. MediaWiki API (Default)

When to use: Quick testing, small-scale evaluation

Setup: No setup required. Used automatically if both WIKI_LMDB_PATH and WIKI_REQUEST_URL are set to "EMPTY".

Pros:

No setup or downloads required
Always uses latest Wikipedia content

Cons:

Rate-limited by Wikipedia
Slow for large-scale evaluations
May experience timeouts

export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="EMPTY"

2. Local LMDB Database (Recommended)

When to use: Large-scale evaluations, best performance

Setup:

Download December 2024 Wikipedia dump
Convert to LMDB format (preprocessing script required)
Point to the LMDB directory

Pros:

Very fast read speeds
No rate limiting
Offline access

Cons:

Initial setup time required
Large disk space needed (~50GB+)
Static snapshot (December 2024)

export WIKI_LMDB_PATH="/path/to/wiki_lmdb_database"
export WIKI_REQUEST_URL="EMPTY"

3. Online Wiki Service

When to use: Medium-scale evaluations, when local setup is not feasible

Setup: Deploy or subscribe to an online Wikipedia API service

Pros:

Stable and reliable
No local storage required
Can be updated

Cons:

Costs money (pay per request)
Requires internet connection
Depends on service availability

export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="https://your-wiki-service.com/api"
export WIKI_REQUEST_AUTH_TOKEN="your-auth-token"

Running Evaluation

Basic Usage

bash scripts/run.sh \
    <MODEL_PATH> \
    <MODEL_ADDRESS> \
    <MODEL_APIKEY> \
    <DATASET> \
    <OUTPUT_PATH> \
    <EVAL_MODE>

Example 1: Simple Evaluation with GPT-4

bash scripts/run.sh \
    "gpt-4o-2024-11-20" \
    "https://api.openai.com/v1" \
    "sk-your-openai-key" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "simple"

Example 2: Holistic Evaluation with Local Model

bash scripts/run.sh \
    "/path/to/local-model" \
    "http://0.0.0.0:6001/v1" \
    "EMPTY" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "holistic"

Example 3: Evaluation with vLLM

The script automatically detects vLLM format addresses and starts the server if needed:

bash scripts/run.sh \
    "/path/to/model" \
    "http://0.0.0.0:6001/v1" \
    "EMPTY" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "holistic"

Understanding Results

Output Structure

After evaluation, results are organized as follows:

OUTPUT_PATH/
├── <model_name>/
│   └── <dataset_name>/
│       ├── iter1.jsonl       # First rollout results
│       ├── iter2.jsonl       # Second rollout results (if ROLLOUT_COUNT > 1)
│       ├── iter3.jsonl       # Third rollout results (if ROLLOUT_COUNT > 1)
│       └── eval_results.json # Detailed evaluation metrics
└── <dataset_name>_summary.jsonl  # Summary of all evaluations

Result Files

iter{N}.jsonl: Raw model outputs for each rollout
- Contains full conversation history
- Tool calls and responses
- Final predictions
- Metadata for each question
eval_results.json: Detailed metrics per round
- Per-question evaluation results
- Correctness judgments
- Search and generation scores
- Knowledge utilization metrics
{dataset}_summary.jsonl: Aggregated metrics
- Overall performance statistics
- Averaged across all rollouts
- Ready for analysis and comparison

Converting Results to Excel

For easier analysis, convert the summary to Excel format:

Edit wikieval_analysis.py:

dataset = "mask_wiki_200.jsonl"
summary_path = f"./eval_output/{dataset}_summary.jsonl"
excel_path = f"./eval_output/{dataset}_summary.xlsx"

Run the conversion:
```
python wikieval_analysis.py
```
View results: Open the generated Excel file to see formatted metrics with:
- Model comparisons
- Performance across difficulty levels
- Detailed statistics

Evaluation Metrics

Simple Mode Metrics

Pass@1: Percentage of questions answered correctly on first attempt
Pass@K: Percentage answered correctly in any of K attempts
Best Pass@1: Best single-round performance
Avg Pass@1: Average performance across all rounds

Holistic Mode Metrics

Primary Metrics

Search Score (0-100%)
- Measures quality of information retrieval
- Whether model found sufficient evidence to answer
- Higher is better
Generation Score (0-100%)
- Combined measure of knowledge utilization and appropriate refusal
- Formula: 0.5 * (Knowledge_Util_F1 + Good_Refusal_F1) * Knowledge_Sufficient_Rate
- Balances correct answering with knowing when to refuse
Pass@K: Overall accuracy metrics
- Same as simple mode but with additional context

Secondary Metrics

Good Refusal Metrics:
- Recall: % of insufficient-knowledge cases where model correctly refused
- Precision: % of refusals that were appropriate
- F1: Harmonic mean of recall and precision
Knowledge Utilization Metrics:
- Recall: % of sufficient-knowledge cases answered correctly
- Precision: % of non-refusal cases that were correct
- F1: Harmonic mean of recall and precision
Knowledge Sufficiency (0-100%):
- % of questions where retrieved information was sufficient
- Combines search results and parametric knowledge
RAG Knowledge Sufficiency (0-100%):
- % of questions answerable from retrieved info alone
- Excludes parametric knowledge
Knowledge Forget to Use (0-100%):
- % of cases where model had sufficient knowledge but failed to use it
- Lower is better
Knowledge Lead Astray (0-100%):
- % of cases where model had correct info but answered incorrectly
- Lower is better

Statistics

Avg Tool Use: Average number of tool calls per question
Avg Visit Tool Use: Average number of Wikipedia page visits
Avg Optimal Hop: Average number of reasoning hops in ground truth

Difficulty Breakdown

Results are also broken down by difficulty:

Easy: 2-3 hops required
Medium: 3-5 hops required
Hard: 5+ hops required

Advanced Usage

Custom Evaluation Script

You can also run evaluation directly with Python:

from utils.react_agent import MultiTurnReactAgent
from utils.prompt import SYSTEM_PROMPT_MULTI

# Configure your model
llm_cfg = {
    'model': 'your-model-path',
    'generate_cfg': {
        'temperature': 0.6,
        'top_p': 0.95,
        'num_retries': 10
    }
}

# Create agent
agent = MultiTurnReactAgent(
    llm=llm_cfg,
    function_list=["search", "visit"],
    system_message=SYSTEM_PROMPT_MULTI
)

# Run on your data
result = agent.run(data_item, user_prompt)

Analyzing Specific Questions

To analyze individual questions:

import json

# Load results
with open('eval_output/model/dataset/iter1.jsonl', 'r') as f:
    results = [json.loads(line) for line in f]

# Find specific question
for item in results:
    if 'your search term' in item['question']:
        print(f"Question: {item['question']}")
        print(f"Prediction: {item['prediction']}")
        print(f"Tool calls: {len([m for m in item['messages'] if m.get('tool_calls')])}")

Troubleshooting

Common Issues

Rate Limiting: If using MediaWiki API, reduce MAX_WORKERS or use local LMDB
Timeouts: Increase timeout values in the code or reduce WEBCONTENT_MAXLENGTH
Memory Issues: Reduce MAX_CONTEXT_SIZE or MAX_WORKERS
vLLM Server Not Starting: Check CUDA availability and model path

Getting Help

For issues or questions:

Check the error messages in console output
Review eval_results.json for detailed error information
Examine individual iter*.jsonl files for failed cases

Citation

If you use this benchmark in your research, please cite:

@misc{song2025demystifyingdeepsearchholistic,
    title={Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics}, 
    author={Maojia Song and Renhang Liu and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Soujanya Poria and Jingren Zhou},
    year={2025},
    eprint={2510.05137},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2510.05137}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval_data		eval_data
eval_example		eval_example
scripts		scripts
utils		utils
README.md		README.md
evaluate.py		evaluate.py
requirement.txt		requirement.txt
run_multi_react.py		run_multi_react.py
wikieval_analysis.py		wikieval_analysis.py

Alibaba-NLP/WebDetective

Folders and files

Latest commit

History

Repository files navigation

WebDetective: Benchmark for Multi-hop Information Retrieval

Features

Table of Contents

Installation

Prerequisites

Steps

Quick Start

Model Configuration

Model Categories

Environment Variables

Required Variables

Optional Variables

Search Configuration

Wikipedia Access Configuration

Summary Model Configuration

Wikipedia Access Methods

1. MediaWiki API (Default)

2. Local LMDB Database (Recommended)

3. Online Wiki Service

Running Evaluation

Basic Usage

Example 1: Simple Evaluation with GPT-4

Example 2: Holistic Evaluation with Local Model

Example 3: Evaluation with vLLM

Understanding Results

Output Structure

Result Files

Converting Results to Excel

Evaluation Metrics

Simple Mode Metrics

Holistic Mode Metrics

Primary Metrics

Secondary Metrics

Statistics

Difficulty Breakdown

Advanced Usage

Custom Evaluation Script

Analyzing Specific Questions

Troubleshooting

Common Issues

Getting Help

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages