Skip to content

A project demonstrating Iterative Completion Testing for LLMs using DeepEval for managing evaluations and SentenceTransformer for semantic similarity checks.

Notifications You must be signed in to change notification settings

jadm11/LLM-iterative-completion-deepeval-test

Repository files navigation

LLM Evaluation with DeepEval and Semantic Similarity

Disclaimer: This script is provided "as is" without any warranty of any kind, either express or implied. Use at your own risk.


Overview

This Python tool evaluates the performance of Language Learning Models (LLMs) by comparing their outputs against expected responses using semantic similarity. It combines the LLMTestCase framework for basic string matching with the SentenceTransformer library for a more advanced, meaning-based comparison.

How It Works

  • LLMTestCase: Provides a straightforward method for comparing the actual LLM output with an expected response using exact string matching.

  • SentenceTransformer: Generates embeddings (dense vector representations) of text, allowing the tool to calculate the semantic similarity between the model's output and the expected responses using cosine similarity.

  • Semantic Similarity: Instead of just checking if the output matches exactly, this tool assesses how similar the meanings of the outputs are by comparing the embeddings of the texts. The similarity score, ranging from 0.0 (no similarity) to 1.0 (perfect match), determines how closely the model’s output aligns with the expected meaning.

Setup

  1. Install Dependencies: Ensure Python 3.6 or later is installed, then install the required packages:

    pip install openai sentence-transformers deepeval colorama
  2. Set Up Environment Variables: Make sure your OpenAI API key is set up:

    export OPENAI_API_KEY='your_openai_api_key'
  3. Run the Script: Execute the evaluation script using Python:

    python3 deepeval_test_semantic_similarity.py

Configuration

  • Context: Defines the background or scenario for the model’s response. This helps tailor the model’s output to a specific type of inquiry.

  • Dynamic Responses: When enabled, the expected responses are dynamically generated by the model. If disabled, predefined responses are used.

  • Threshold: Controls the strictness of the similarity comparison. Higher thresholds require more precise matches, while lower thresholds allow for greater variation in wording.

How to Use

  1. Set Configuration Parameters:

    • Modify the script's configuration at the top to suit your testing needs:
      context = "Humor"
      use_dynamic_responses = True
      threshold = 0.5
      prompt = "Why did the chicken cross the road?"
  2. Run the Evaluation:

    • Run the script:

      python3 deepeval_test_semantic_similarity.py
    • The script will execute the evaluation, comparing the model’s output to the expected responses and generating a report.

Use Cases

  • Model Performance Testing: Engineers can assess how well an LLM performs across various contexts, ensuring that it meets the desired standards of accuracy and relevance.

  • Threshold Adjustment: The threshold can be tuned to balance between precision and flexibility, depending on the application’s requirements.

  • Dynamic vs. Static Testing: Choose between dynamically generating expected responses or using predefined ones to test the model’s versatility and consistency.

Interpreting the Results

  • The evaluation generates a report detailing the context, similarity threshold, cosine similarity scores, and whether the model's output passed the test.
  • A Pass indicates that the output met the similarity criteria, while a Fail suggests it did not align closely enough with the expected meaning.

License

This is licensed under the MIT License.

About

A project demonstrating Iterative Completion Testing for LLMs using DeepEval for managing evaluations and SentenceTransformer for semantic similarity checks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published