Skip to content

Conversation

@akshathmangudi
Copy link
Contributor

@akshathmangudi akshathmangudi commented Jan 19, 2026

ISSUE SUMMARY

Resolves #1109

This PR implements evaluation of RAG systems in LightEval. The PR provides a flexible adapter pattern that allows uers to plug in any retriever and generator combination, enabling evaluation of RAG systems on LightEval benchmarks.

The implementation provides:

  1. RAGAdapterModel: A base class that implements the LightevalModel interface.
  2. Interfaces to support any retrieval/generator implementation
  3. A working example using sentence transformers for retrieval and T5 for generation.

The RAG adapter works by:

  1. Receiving standard Doc objects.
  2. Performs retrieval internally using the query.
  3. Augments the prompt with retrieved context
  4. Generates a response using the generator
  5. Returns standard ModelResponse objects.

This allows RAG systems to be evaluated on benchmarks like TrivialQA, MMLU, etc. using the same metrics (exact_match, F1, ROUGE) as traditional language models.

You can take a look at the example provided in examples/custom_models/rag_model_example.py.

Quick Start (for example)

lighteval custom \
    "rag-flan" \
    "examples/custom_models/rag_model_example.py" \
    "triviaqa" \
    --max-samples 10 \
    --save-details

To implement your own RAG Model

Step 1. Implement Retriever

from lighteval.models.rag.rag_model import RetrieverProtocol, RetrievedDocument

class MyRetriever(RetrieverProtocol):
    def retrieve(self, query: str, top_k: int = 5) -> list[RetrievedDocument]:
        # Your retrieval logic here (FAISS, BM25, etc.)
        return [
            RetrievedDocument(text="...", score=0.95, metadata={"doc_id": 123})
        ]

Step 2. Implement Generator

from lighteval.models.rag.rag_model import GeneratorProtocol

class MyGenerator(GeneratorProtocol):
    def generate(
        self, 
        prompt: str, 
        max_new_tokens: Optional[int] = None, 
        stop_sequences: Optional[list[str]] = None, 
        **kwargs
    ) -> str:
        # Your generation logic here (Transformers, vLLM, TGI, etc.)
        return "Generated answer"
    
    # Optional: provide tokenizer for token counting
    @property
    def tokenizer(self):
        return self._tokenizer

Step 3. Create RAG Model

from lighteval.models.rag.rag_model import RAGAdapterModel
from lighteval.models.custom.custom_model import CustomModelConfig

class MyRAGModel(RAGAdapterModel):
    def __init__(self, config):
        retriever = MyRetriever()
        generator = MyGenerator()
        super().__init__(config, retriever, generator, top_k=5)

Step 4. Evaluate

lighteval custom "my-rag-model" "path/to/my_rag_model.py" "triviaqa"

(OR) using the Python API

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.custom.custom_model import CustomModelConfig
from lighteval.pipeline import Pipeline, PipelineParameters, ParallelismManager

evaluation_tracker = EvaluationTracker(output_dir="results", save_details=True)
pipeline_params = PipelineParameters(launcher_type=ParallelismManager.CUSTOM)

model_config = CustomModelConfig(
    model_name="my-rag-model",
    model_definition_file_path="path/to/my_rag_model.py"
)

pipeline = Pipeline(
    tasks="triviaqa",
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config
)

pipeline.evaluate()
pipeline.save_and_push_results()

Limitations

  1. Current implementation focuses on open-ended QA. Multiple-choice tasks would need additional logic to map generated text to choices.
  2. Different benchmarks may require different normalization strategies. The example provides a TriviaQA-compatible normalization
  3. The example above uses simple cosine similarity. Production systems might usemore sophisticated retrieval (reranking, hybrid search, etc)

@akshathmangudi akshathmangudi marked this pull request as ready for review January 23, 2026 13:55
Copilot AI review requested due to automatic review settings January 23, 2026 13:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements support for evaluating Retrieval-Augmented Generation (RAG) systems within LightEval, addressing issue #1109. It introduces a flexible adapter pattern that allows users to plug in any retriever and generator combination to evaluate RAG systems on existing LightEval benchmarks.

Changes:

  • Added RAGAdapterModel base class implementing the LightevalModel interface with protocols for retriever and generator components
  • Extended ModelResponse dataclass with an optional metadata field for storing retrieval information
  • Provided a working example implementation using sentence transformers for retrieval and T5 for generation with a TriviaQA-focused document corpus

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 17 comments.

File Description
src/lighteval/models/rag/rag_model.py Core RAG adapter implementation with RetrieverProtocol, GeneratorProtocol, ContextFormatter utility class, and RAGAdapterModel base class
src/lighteval/models/model_output.py Added optional metadata field to ModelResponse for storing retrieval information and other model-specific data
examples/custom_models/rag_model_example.py Complete working example with SimpleVectorRetriever and SimpleGenerator demonstrating RAG evaluation on TriviaQA-style tasks
src/lighteval/models/custom/rag_adapters.py Placeholder file marked "TO BE IMPLEMENTED"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@akshathmangudi
Copy link
Contributor Author

cc: @NathanHB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FT] Support for retriever-augmented and latent-memory models.

1 participant