Disclaimer: This script is provided "as is" without any warranty of any kind, either express or implied. Use at your own risk.
- LinkedIn Article: https://www.linkedin.com/pulse/testing-ai-models-example-iterative-completion-jacob-adm-yhvpe/
- Copy of working script with full logging and comments: https://github.com/jadm11/LLM-iterative-completion-deepeval-test/blob/698dbc93b2f9c5f526ef9f2de3a5a281867a7968/deepeval_test_semantic_similarity.py
This Python tool evaluates the performance of Language Learning Models (LLMs) by comparing their outputs against expected responses using semantic similarity. It combines the LLMTestCase framework for basic string matching with the SentenceTransformer library for a more advanced, meaning-based comparison.
-
LLMTestCase: Provides a straightforward method for comparing the actual LLM output with an expected response using exact string matching.
-
SentenceTransformer: Generates embeddings (dense vector representations) of text, allowing the tool to calculate the semantic similarity between the model's output and the expected responses using cosine similarity.
-
Semantic Similarity: Instead of just checking if the output matches exactly, this tool assesses how similar the meanings of the outputs are by comparing the embeddings of the texts. The similarity score, ranging from 0.0 (no similarity) to 1.0 (perfect match), determines how closely the model’s output aligns with the expected meaning.
-
Install Dependencies: Ensure Python 3.6 or later is installed, then install the required packages:
pip install openai sentence-transformers deepeval colorama
-
Set Up Environment Variables: Make sure your OpenAI API key is set up:
export OPENAI_API_KEY='your_openai_api_key'
-
Run the Script: Execute the evaluation script using Python:
python3 deepeval_test_semantic_similarity.py
-
Context: Defines the background or scenario for the model’s response. This helps tailor the model’s output to a specific type of inquiry.
-
Dynamic Responses: When enabled, the expected responses are dynamically generated by the model. If disabled, predefined responses are used.
-
Threshold: Controls the strictness of the similarity comparison. Higher thresholds require more precise matches, while lower thresholds allow for greater variation in wording.
-
Set Configuration Parameters:
- Modify the script's configuration at the top to suit your testing needs:
context = "Humor" use_dynamic_responses = True threshold = 0.5 prompt = "Why did the chicken cross the road?"
- Modify the script's configuration at the top to suit your testing needs:
-
Run the Evaluation:
-
Run the script:
python3 deepeval_test_semantic_similarity.py
-
The script will execute the evaluation, comparing the model’s output to the expected responses and generating a report.
-
-
Model Performance Testing: Engineers can assess how well an LLM performs across various contexts, ensuring that it meets the desired standards of accuracy and relevance.
-
Threshold Adjustment: The threshold can be tuned to balance between precision and flexibility, depending on the application’s requirements.
-
Dynamic vs. Static Testing: Choose between dynamically generating expected responses or using predefined ones to test the model’s versatility and consistency.
- The evaluation generates a report detailing the context, similarity threshold, cosine similarity scores, and whether the model's output passed the test.
- A Pass indicates that the output met the similarity criteria, while a Fail suggests it did not align closely enough with the expected meaning.
This is licensed under the MIT License.