GitHub - jadm11/LLM-iterative-completion-deepeval-test: A project demonstrating Iterative Completion Testing for LLMs using DeepEval for managing evaluations and SentenceTransformer for semantic similarity checks.

LLM Evaluation with DeepEval and Semantic Similarity

Disclaimer: This script is provided "as is" without any warranty of any kind, either express or implied. Use at your own risk.

LinkedIn Article: https://www.linkedin.com/pulse/testing-ai-models-example-iterative-completion-jacob-adm-yhvpe/
Copy of working script with full logging and comments: https://github.com/jadm11/LLM-iterative-completion-deepeval-test/blob/698dbc93b2f9c5f526ef9f2de3a5a281867a7968/deepeval_test_semantic_similarity.py

Overview

This Python tool evaluates the performance of Language Learning Models (LLMs) by comparing their outputs against expected responses using semantic similarity. It combines the LLMTestCase framework for basic string matching with the SentenceTransformer library for a more advanced, meaning-based comparison.

How It Works

LLMTestCase: Provides a straightforward method for comparing the actual LLM output with an expected response using exact string matching.
SentenceTransformer: Generates embeddings (dense vector representations) of text, allowing the tool to calculate the semantic similarity between the model's output and the expected responses using cosine similarity.
Semantic Similarity: Instead of just checking if the output matches exactly, this tool assesses how similar the meanings of the outputs are by comparing the embeddings of the texts. The similarity score, ranging from 0.0 (no similarity) to 1.0 (perfect match), determines how closely the model’s output aligns with the expected meaning.

Setup

Install Dependencies: Ensure Python 3.6 or later is installed, then install the required packages:
```
pip install openai sentence-transformers deepeval colorama
```
Set Up Environment Variables: Make sure your OpenAI API key is set up:
```
export OPENAI_API_KEY='your_openai_api_key'
```
Run the Script: Execute the evaluation script using Python:
```
python3 deepeval_test_semantic_similarity.py
```

Configuration

Context: Defines the background or scenario for the model’s response. This helps tailor the model’s output to a specific type of inquiry.
Dynamic Responses: When enabled, the expected responses are dynamically generated by the model. If disabled, predefined responses are used.
Threshold: Controls the strictness of the similarity comparison. Higher thresholds require more precise matches, while lower thresholds allow for greater variation in wording.

How to Use

Set Configuration Parameters:

Modify the script's configuration at the top to suit your testing needs:

context = "Humor"
use_dynamic_responses = True
threshold = 0.5
prompt = "Why did the chicken cross the road?"

Run the Evaluation:
- Run the script:
```
python3 deepeval_test_semantic_similarity.py
```
- The script will execute the evaluation, comparing the model’s output to the expected responses and generating a report.

Use Cases

Model Performance Testing: Engineers can assess how well an LLM performs across various contexts, ensuring that it meets the desired standards of accuracy and relevance.
Threshold Adjustment: The threshold can be tuned to balance between precision and flexibility, depending on the application’s requirements.
Dynamic vs. Static Testing: Choose between dynamically generating expected responses or using predefined ones to test the model’s versatility and consistency.

Interpreting the Results

The evaluation generates a report detailing the context, similarity threshold, cosine similarity scores, and whether the model's output passed the test.
A Pass indicates that the output met the similarity criteria, while a Fail suggests it did not align closely enough with the expected meaning.

License

This is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.gitignore		.gitignore
README.md		README.md
deepeval_test_semantic_similarity.py		deepeval_test_semantic_similarity.py
deepeval_test_simulate-pass.py		deepeval_test_simulate-pass.py
gherkin-template.feature		gherkin-template.feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Evaluation with DeepEval and Semantic Similarity

Overview

How It Works

Setup

Configuration

How to Use

Use Cases

Interpreting the Results

License

About

Uh oh!

Releases

Packages

Languages

jadm11/LLM-iterative-completion-deepeval-test

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation with DeepEval and Semantic Similarity

Overview

How It Works

Setup

Configuration

How to Use

Use Cases

Interpreting the Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages