A framework for evaluating and testing Large Language Models (LLMs) across various dimensions including agency detection (the framework is modular and can easily accept new evaluators). Features both a CLI interface and a FastAPI server for API access to all testing functionality.
- Python 3.8 (major refactoring needs to take place to update to a newer version of python, but everything works as is now)
- pip (Python package installer)
- Git (for cloning the repository)
- Clone the repository:
git clone https://github.com/SamH135/LLM-Assessment-Framework.git
- (Optional but recommended) Create and activate a virtual environment:
# Using venv
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate- Install framework dependencies:
# Includes core dependencies and fastAPI server dependencies
pip install -r requirements.txt
if the requirements files don't work, clear the dependencies and try the setup.py
```bash
pip install -e .Core dependencies include:
- transformers (≥4.30.0)
- torch (≥2.0.0)
- numpy (≥1.24.0)
- tokenizers (≥0.13.0)
- tqdm (≥4.65.0)
- datasets (≥2.10.0)
- huggingface-hub (≥0.16.0)
- requests (≥2.32.2)
API Server dependencies include:
- fastapi (==0.68.0)
- uvicorn (==0.15.0)
- pydantic (==1.8.2)
- python-multipart (==0.0.5)
- starlette (==0.14.2)
framework/
├── core/ # Core framework components
│ ├── base.py # Base classes and interfaces
│ └── registry.py # Component registry
├── evaluators/ # Test evaluators
│ └── agency/ # Agency detection
├── interfaces/ # LLM interfaces
│ └── llm/
│ ├── api.py # Generic API interface
│ └── huggingface.py
├── utils/ # Utility functions
│ └── prompts.py # Prompt management
└── api_server.py # FastAPI server
- Mainly used for new feature/evaluator testing
- Runs the framework from the command line
python run.pyStart the FastAPI server:
uvicorn api_server:app --reload --port 8000or (the one I use):
python api_server.pythis what you should see in the python IDE terminal when you run 'python api_server.py'
(base) PS C:\Users\samue\OneDrive\Desktop\llm_framework_test2\AI-GARDIAN> python api_server.py
Successfully registered evaluator: Agency Analysis
INFO:framework.evaluators.dualAgency.evaluator:Semantic model initialized successfully on cpu
Successfully registered evaluator: Enhanced Agency Analysis
Successfully registered evaluator: Response Length Analysis
Registered LLM interface: API Model
Registered LLM interface: AWS Bedrock Model
Registered LLM interface: HuggingFace Model
INFO: Started server process [10172]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
The API server provides the following endpoints:
Get available evaluators with metadata.
Response:
{
"success": true,
"data": [{
"id": "Agency Analysis",
"name": "Agency Analysis",
"description": "Evaluates the level of agency expressed in AI responses",
"version": "1.0.0",
"category": "Safety",
"tags": ["agency", "safety", "boundaries", "capabilities"]
}]
}Get available models with configuration options.
Response:
{
"success": true,
"data": [{
"id": "HuggingFace Model",
"name": "HuggingFace Model",
"configuration_options": {
"model_name": {
"type": "string",
"description": "HuggingFace model identifier",
"default": "gpt2",
"examples": ["gpt2", "facebook/opt-350m", "EleutherAI/gpt-neo-125M"]
},
"max_length": {
"type": "integer",
"description": "Maximum length of generated response",
"default": 100,
"minimum": 10,
"maximum": 1000
}
}
}]
}Load prompts from a file.
Request:
{
"file_path": "path/to/prompts.txt"
}Response:
{
"success": true,
"data": {
"categories": ["Category1", "Category2"],
"prompts": {
"Category1": [
{
"id": 0,
"category": "Category1",
"text": "prompt1"
}
]
}
}
}Run selected tests on a prompt.
Request:
{
"model_type": "HuggingFace Model",
"configuration": {
"model_name": "gpt2",
"max_length": 50
},
"prompt": "What is the meaning of life?",
"selected_tests": ["Agency Analysis"]
}Response:
{
"success": true,
"data": {
"prompt": "What is the meaning of life?",
"response": "Life is a journey of discovery...",
"results": {
"Agency Analysis": {
"raw_results": {
"agency_score": 0,
"high_agency_phrases": []
},
"interpretation": "GOOD: Low agency detected...",
"summary": {
"score": 0,
"risk_level": "Low",
"key_findings": [
"GOOD: Low agency detected",
"Agency score: 0.00"
]
}
}
}
}
}For detailed API documentation, see API_documentation.md.
Category1
prompt1
prompt2
Category2
prompt3
prompt4
Evaluates the level of agency expressed in AI responses, detecting:
- High agency phrases
- Capability claims
- Disclaimers
- Action verbs
- Emotional expression
- Create a new directory under
framework/evaluators/ - Implement the BaseEvaluator interface
- Add evaluator metadata
Example:
class NewEvaluator(BaseEvaluator):
@classmethod
def get_metadata(cls) -> EvaluatorMetadata:
return EvaluatorMetadata(
name="New Test",
description="Test description",
version="1.0.0",
category="Category",
tags=["tag1", "tag2"]
)- Create a new Python file under
framework/interfaces/llm/ - Implement the BaseLLMInterface interface
- Define configuration schema
NOTE: the configuration schema provides information to the frontend so that it can dynamically render the required input fields for a given connection interface
Example:
class NewInterface(BaseLLMInterface):
def __init__(self, **kwargs):
# Initialize your interface with configuration
pass
@classmethod
def get_name(cls) -> str:
return "Your Interface Name" # This appears in the UI dropdown
@classmethod
def get_configuration_schema(cls) -> Dict[str, Any]:
return {
"field_name": {
"type": "string", # string, number, password
"description": "Field description",
"default": "default_value",
"required": True,
"examples": ["example1", "example2"]
}
# Add more configuration fields as needed
}
def generate_response(self, prompt: str, **kwargs) -> str:
# Implement response generation logic
return "Generated response"The interface will be automatically discovered and registered by the framework.
The configuration schema defines what fields appear in the UI when your interface is selected.
Supported configuration field types:
- string: Text input
- number: Numeric input with optional min/max
- password: Secured input for sensitive data
- url: URL input with validation
Configuration field attributes:
- type: Field input type
- description: Help text shown to users
- default: Default value
- required: Whether field is mandatory
- examples: Example values shown to users
- sensitive: For password fields
- min/max: For number fields
- Ensure Python 3.10 (or lower) is installed:
python --version- Verify all dependencies are installed:
pip list- Check API server logs:
uvicorn api_server:app --log-level debug- Verify API server is running:
curl http://localhost:8000/api/evaluators- Fork the repository
- Create a feature branch
- Commit your changes (follow contribution guidelines)
- Push to the branch
- Create a Pull Request
This project is proprietary software, architected by Sam Heidler, and owned by Secure IVAI.
All rights reserved. © 2025 Secure IVAI. Unauthorized use, reproduction, or distribution of this software or any portion of it may result in severe civil and criminal penalties, and will be prosecuted to the maximum extent possible under the law.