Skip to content

A new evaluation paradigm for deep search that identifies specific LLM failure sources, introduces challenging hint-free datasets with holistic evaluation, and offers a strong baseline incorporating memory and verification.

Notifications You must be signed in to change notification settings

Alibaba-NLP/WebDetective

Repository files navigation

WebDetective: Benchmark for Multi-hop Information Retrieval

WebDetective is an evaluation framework for assessing large language models' ability to perform multi-hop question answering using Wikipedia as the knowledge source. The benchmark evaluates models on their search capabilities, knowledge utilization, and answer generation quality.

Features

  • Multi-hop Question Answering: Evaluate models on questions requiring multiple reasoning steps across Wikipedia articles
  • Two Evaluation Modes: Simple (accuracy-focused) and holistic (comprehensive metrics)
  • Flexible Wikipedia Access: Support for MediaWiki API, local LMDB database, or online wiki services
  • Function Calling Support: Test models with and without function calling capabilities
  • Comprehensive Metrics: Track search scores, knowledge utilization, refusal behavior, and more

Table of Contents

Installation

Prerequisites

  • Python 3.10
  • Conda (recommended)

Steps

  1. Create a conda environment:

    conda create -n WebDetective python=3.10
    conda activate WebDetective
  2. Install dependencies:

    pip install -r requirement.txt
  3. Download NLTK data (if needed):

    import nltk
    nltk.download('punkt')

Quick Start

Here's a minimal example to get started:

# Set up your evaluation
bash scripts/run.sh \
    "your-model-path" \
    "http://your-model-endpoint/v1" \
    "your-api-key" \
    "mask_wiki_200.jsonl" \
    "./output" \
    "holistic"

Model Configuration

Before evaluating your model, you need to configure it in utils/constant.py. Add your model to the appropriate category:

Model Categories

  1. FC_MODELS (Function Calling Models)

    • Models that support function calling
    • Examples: GPT-4, Claude-3, etc.
    FC_MODELS = [
        "gpt-4o-2024-11-20",
        "claude-3-7-sonnet-20250219",
        "your-model-name",  # Add your model here
    ]
  2. NONTEMP_MODELS (Non-Temperature Models)

    • Models that don't support temperature parameter
    • Examples: o1, o3, reasoning models
    NONTEMP_MODELS = [
        "o1-2024-12-17",
        "o3-mini-2025-01-31",
        "your-reasoning-model",  # Add your model here
    ]
  3. REASONING_MODELS (Reasoning Effort Models)

    • Models that support reasoning_effort parameter
    • Examples: o-series models
    REASONING_MODELS = [
        "o1-2024-12-17",
        "o3-2025-04-16",
        "your-reasoning-model",  # Add your model here
    ]

Environment Variables

Configure these variables in scripts/run.sh:

Required Variables

Variable Description Example
MODEL_PATH Path or identifier of your model "gpt-4o-2024-11-20"
MODEL_ADDRESS API endpoint for your model "http://0.0.0.0:6001/v1"
MODEL_APIKEY API key for authentication "your-api-key"
DATASET Evaluation dataset file "mask_wiki_200.jsonl"
OUTPUT_PATH Directory to save results "./eval_output"
EVAL_MODE Evaluation mode "simple" or "holistic"

Optional Variables

Variable Default Description
MAX_CONTEXT_SIZE 31*1024-500 Maximum context window size
TEMPERATURE 0.6 Generation temperature (ignored for NONTEMP_MODELS)
REASONING_EFFORT "high" Reasoning effort level: "low", "medium", or "high"
WEBCONTENT_MAXLENGTH 150000 Maximum length of web content to process
MAX_WORKERS 10 Number of parallel workers (be careful with local wiki)
ROLLOUT_COUNT 1 Number of evaluation rounds
NLTK_DATA "path/to/nltk_data" Path to NLTK data directory

Search Configuration

Variable Description Required
SERPER_API_KEY API key for Serper search service Yes
SEARCH_API_URL Search API endpoint Default: "http://google.serper.dev/search"

Wikipedia Access Configuration

Variable Description Default
WIKI_URL_PREFIX Wikipedia URL prefix "https://en.wikipedia.org/wiki/"
WIKI_LMDB_PATH Path to local LMDB database "EMPTY"
WIKI_REQUEST_URL Online wiki service URL "EMPTY"
WIKI_REQUEST_AUTH_TOKEN Auth token for online service "EMPTY"

Summary Model Configuration

Variable Description Default
SUMMARY_MODEL_PATH Model for summarization "path/to/Qwen2.5-72B-Instruct"
SUMMARY_MODEL_ADDRESS Summary model endpoint "http://0.0.0.0:6002/v1"
SUMMARY_MODEL_APIKEY Summary model API key "EMPTY"

Wikipedia Access Methods

WebDetective supports three methods to access Wikipedia content, each with different trade-offs:

1. MediaWiki API (Default)

When to use: Quick testing, small-scale evaluation

Setup: No setup required. Used automatically if both WIKI_LMDB_PATH and WIKI_REQUEST_URL are set to "EMPTY".

Pros:

  • No setup or downloads required
  • Always uses latest Wikipedia content

Cons:

  • Rate-limited by Wikipedia
  • Slow for large-scale evaluations
  • May experience timeouts
export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="EMPTY"

2. Local LMDB Database (Recommended)

When to use: Large-scale evaluations, best performance

Setup:

  1. Download December 2024 Wikipedia dump
  2. Convert to LMDB format (preprocessing script required)
  3. Point to the LMDB directory

Pros:

  • Very fast read speeds
  • No rate limiting
  • Offline access

Cons:

  • Initial setup time required
  • Large disk space needed (~50GB+)
  • Static snapshot (December 2024)
export WIKI_LMDB_PATH="/path/to/wiki_lmdb_database"
export WIKI_REQUEST_URL="EMPTY"

3. Online Wiki Service

When to use: Medium-scale evaluations, when local setup is not feasible

Setup: Deploy or subscribe to an online Wikipedia API service

Pros:

  • Stable and reliable
  • No local storage required
  • Can be updated

Cons:

  • Costs money (pay per request)
  • Requires internet connection
  • Depends on service availability
export WIKI_LMDB_PATH="EMPTY"
export WIKI_REQUEST_URL="https://your-wiki-service.com/api"
export WIKI_REQUEST_AUTH_TOKEN="your-auth-token"

Running Evaluation

Basic Usage

bash scripts/run.sh \
    <MODEL_PATH> \
    <MODEL_ADDRESS> \
    <MODEL_APIKEY> \
    <DATASET> \
    <OUTPUT_PATH> \
    <EVAL_MODE>

Example 1: Simple Evaluation with GPT-4

bash scripts/run.sh \
    "gpt-4o-2024-11-20" \
    "https://api.openai.com/v1" \
    "sk-your-openai-key" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "simple"

Example 2: Holistic Evaluation with Local Model

bash scripts/run.sh \
    "/path/to/local-model" \
    "http://0.0.0.0:6001/v1" \
    "EMPTY" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "holistic"

Example 3: Evaluation with vLLM

The script automatically detects vLLM format addresses and starts the server if needed:

bash scripts/run.sh \
    "/path/to/model" \
    "http://0.0.0.0:6001/v1" \
    "EMPTY" \
    "mask_wiki_200.jsonl" \
    "./eval_output" \
    "holistic"

Understanding Results

Output Structure

After evaluation, results are organized as follows:

OUTPUT_PATH/
├── <model_name>/
│   └── <dataset_name>/
│       ├── iter1.jsonl       # First rollout results
│       ├── iter2.jsonl       # Second rollout results (if ROLLOUT_COUNT > 1)
│       ├── iter3.jsonl       # Third rollout results (if ROLLOUT_COUNT > 1)
│       └── eval_results.json # Detailed evaluation metrics
└── <dataset_name>_summary.jsonl  # Summary of all evaluations

Result Files

  1. iter{N}.jsonl: Raw model outputs for each rollout

    • Contains full conversation history
    • Tool calls and responses
    • Final predictions
    • Metadata for each question
  2. eval_results.json: Detailed metrics per round

    • Per-question evaluation results
    • Correctness judgments
    • Search and generation scores
    • Knowledge utilization metrics
  3. {dataset}_summary.jsonl: Aggregated metrics

    • Overall performance statistics
    • Averaged across all rollouts
    • Ready for analysis and comparison

Converting Results to Excel

For easier analysis, convert the summary to Excel format:

  1. Edit wikieval_analysis.py:

    dataset = "mask_wiki_200.jsonl"
    summary_path = f"./eval_output/{dataset}_summary.jsonl"
    excel_path = f"./eval_output/{dataset}_summary.xlsx"
  2. Run the conversion:

    python wikieval_analysis.py
  3. View results: Open the generated Excel file to see formatted metrics with:

    • Model comparisons
    • Performance across difficulty levels
    • Detailed statistics

Evaluation Metrics

Simple Mode Metrics

  • Pass@1: Percentage of questions answered correctly on first attempt
  • Pass@K: Percentage answered correctly in any of K attempts
  • Best Pass@1: Best single-round performance
  • Avg Pass@1: Average performance across all rounds

Holistic Mode Metrics

Primary Metrics

  1. Search Score (0-100%)

    • Measures quality of information retrieval
    • Whether model found sufficient evidence to answer
    • Higher is better
  2. Generation Score (0-100%)

    • Combined measure of knowledge utilization and appropriate refusal
    • Formula: 0.5 * (Knowledge_Util_F1 + Good_Refusal_F1) * Knowledge_Sufficient_Rate
    • Balances correct answering with knowing when to refuse
  3. Pass@K: Overall accuracy metrics

    • Same as simple mode but with additional context

Secondary Metrics

  1. Good Refusal Metrics:

    • Recall: % of insufficient-knowledge cases where model correctly refused
    • Precision: % of refusals that were appropriate
    • F1: Harmonic mean of recall and precision
  2. Knowledge Utilization Metrics:

    • Recall: % of sufficient-knowledge cases answered correctly
    • Precision: % of non-refusal cases that were correct
    • F1: Harmonic mean of recall and precision
  3. Knowledge Sufficiency (0-100%):

    • % of questions where retrieved information was sufficient
    • Combines search results and parametric knowledge
  4. RAG Knowledge Sufficiency (0-100%):

    • % of questions answerable from retrieved info alone
    • Excludes parametric knowledge
  5. Knowledge Forget to Use (0-100%):

    • % of cases where model had sufficient knowledge but failed to use it
    • Lower is better
  6. Knowledge Lead Astray (0-100%):

    • % of cases where model had correct info but answered incorrectly
    • Lower is better

Statistics

  • Avg Tool Use: Average number of tool calls per question
  • Avg Visit Tool Use: Average number of Wikipedia page visits
  • Avg Optimal Hop: Average number of reasoning hops in ground truth

Difficulty Breakdown

Results are also broken down by difficulty:

  • Easy: 2-3 hops required
  • Medium: 3-5 hops required
  • Hard: 5+ hops required

Advanced Usage

Custom Evaluation Script

You can also run evaluation directly with Python:

from utils.react_agent import MultiTurnReactAgent
from utils.prompt import SYSTEM_PROMPT_MULTI

# Configure your model
llm_cfg = {
    'model': 'your-model-path',
    'generate_cfg': {
        'temperature': 0.6,
        'top_p': 0.95,
        'num_retries': 10
    }
}

# Create agent
agent = MultiTurnReactAgent(
    llm=llm_cfg,
    function_list=["search", "visit"],
    system_message=SYSTEM_PROMPT_MULTI
)

# Run on your data
result = agent.run(data_item, user_prompt)

Analyzing Specific Questions

To analyze individual questions:

import json

# Load results
with open('eval_output/model/dataset/iter1.jsonl', 'r') as f:
    results = [json.loads(line) for line in f]

# Find specific question
for item in results:
    if 'your search term' in item['question']:
        print(f"Question: {item['question']}")
        print(f"Prediction: {item['prediction']}")
        print(f"Tool calls: {len([m for m in item['messages'] if m.get('tool_calls')])}")

Troubleshooting

Common Issues

  1. Rate Limiting: If using MediaWiki API, reduce MAX_WORKERS or use local LMDB
  2. Timeouts: Increase timeout values in the code or reduce WEBCONTENT_MAXLENGTH
  3. Memory Issues: Reduce MAX_CONTEXT_SIZE or MAX_WORKERS
  4. vLLM Server Not Starting: Check CUDA availability and model path

Getting Help

For issues or questions:

  1. Check the error messages in console output
  2. Review eval_results.json for detailed error information
  3. Examine individual iter*.jsonl files for failed cases

Citation

If you use this benchmark in your research, please cite:

@misc{song2025demystifyingdeepsearchholistic,
    title={Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics}, 
    author={Maojia Song and Renhang Liu and Xinyu Wang and Yong Jiang and Pengjun Xie and Fei Huang and Soujanya Poria and Jingren Zhou},
    year={2025},
    eprint={2510.05137},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2510.05137}, 
}

About

A new evaluation paradigm for deep search that identifies specific LLM failure sources, introduces challenging hint-free datasets with holistic evaluation, and offers a strong baseline incorporating memory and verification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published