Support for retriever-augmented models. #1125

akshathmangudi · 2026-01-19T05:39:59Z

ISSUE SUMMARY

Resolves #1109

This PR implements evaluation of RAG systems in LightEval. The PR provides a flexible adapter pattern that allows uers to plug in any retriever and generator combination, enabling evaluation of RAG systems on LightEval benchmarks.

The implementation provides:

RAGAdapterModel: A base class that implements the LightevalModel interface.
Interfaces to support any retrieval/generator implementation
A working example using sentence transformers for retrieval and T5 for generation.

The RAG adapter works by:

Receiving standard Doc objects.
Performs retrieval internally using the query.
Augments the prompt with retrieved context
Generates a response using the generator
Returns standard ModelResponse objects.

This allows RAG systems to be evaluated on benchmarks like TrivialQA, MMLU, etc. using the same metrics (exact_match, F1, ROUGE) as traditional language models.

You can take a look at the example provided in examples/custom_models/rag_model_example.py.

Quick Start (for example)

lighteval custom \
    "rag-flan" \
    "examples/custom_models/rag_model_example.py" \
    "triviaqa" \
    --max-samples 10 \
    --save-details

To implement your own RAG Model

Step 1. Implement Retriever

from lighteval.models.rag.rag_model import RetrieverProtocol, RetrievedDocument

class MyRetriever(RetrieverProtocol):
    def retrieve(self, query: str, top_k: int = 5) -> list[RetrievedDocument]:
        # Your retrieval logic here (FAISS, BM25, etc.)
        return [
            RetrievedDocument(text="...", score=0.95, metadata={"doc_id": 123})
        ]

Step 2. Implement Generator

from lighteval.models.rag.rag_model import GeneratorProtocol

class MyGenerator(GeneratorProtocol):
    def generate(
        self, 
        prompt: str, 
        max_new_tokens: Optional[int] = None, 
        stop_sequences: Optional[list[str]] = None, 
        **kwargs
    ) -> str:
        # Your generation logic here (Transformers, vLLM, TGI, etc.)
        return "Generated answer"
    
    # Optional: provide tokenizer for token counting
    @property
    def tokenizer(self):
        return self._tokenizer

Step 3. Create RAG Model

from lighteval.models.rag.rag_model import RAGAdapterModel
from lighteval.models.custom.custom_model import CustomModelConfig

class MyRAGModel(RAGAdapterModel):
    def __init__(self, config):
        retriever = MyRetriever()
        generator = MyGenerator()
        super().__init__(config, retriever, generator, top_k=5)

Step 4. Evaluate

lighteval custom "my-rag-model" "path/to/my_rag_model.py" "triviaqa"

(OR) using the Python API

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.custom.custom_model import CustomModelConfig
from lighteval.pipeline import Pipeline, PipelineParameters, ParallelismManager

evaluation_tracker = EvaluationTracker(output_dir="results", save_details=True)
pipeline_params = PipelineParameters(launcher_type=ParallelismManager.CUSTOM)

model_config = CustomModelConfig(
    model_name="my-rag-model",
    model_definition_file_path="path/to/my_rag_model.py"
)

pipeline = Pipeline(
    tasks="triviaqa",
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config
)

pipeline.evaluate()
pipeline.save_and_push_results()

Limitations

Current implementation focuses on open-ended QA. Multiple-choice tasks would need additional logic to map generated text to choices.
Different benchmarks may require different normalization strategies. The example provides a TriviaQA-compatible normalization
The example above uses simple cosine similarity. Production systems might usemore sophisticated retrieval (reranking, hybrid search, etc)

…est things out once again

Copilot

Pull request overview

This PR implements support for evaluating Retrieval-Augmented Generation (RAG) systems within LightEval, addressing issue #1109. It introduces a flexible adapter pattern that allows users to plug in any retriever and generator combination to evaluate RAG systems on existing LightEval benchmarks.

Changes:

Added RAGAdapterModel base class implementing the LightevalModel interface with protocols for retriever and generator components
Extended ModelResponse dataclass with an optional metadata field for storing retrieval information
Provided a working example implementation using sentence transformers for retrieval and T5 for generation with a TriviaQA-focused document corpus