This directory contains examples of using the Inspect AI framework to evaluate LLM agentic capabilities using web browsing tasks.
The browser task tests a model's ability to:
- Use tools (specifically web_browser)
- Navigate websites and follow multi-step instructions
- Extract relevant information from web pages
- Synthesize and summarize information
Complete directory layout:
inspect-examples/
├── 📁 examples/ # Python package with evaluation tasks
│ ├── __init__.py # Package initialization
│ └── browser/ # Browser example
│ ├── __init__.py
│ ├── browser.py # Browser task
│ ├── compose.yaml # Docker configuration
│ └── README.md # Browser example documentation
│
├── 📁 logs/ # Evaluation logs (auto-created)
│ └── *.json # Individual log files
│
├── 📁 .venv/ # Virtual environment (optional)
│ └── ... # Python packages
│
├── 📄 pyproject.toml # Python project configuration & dependencies
├── 📄 uv.lock # uv lockfile (auto-generated)
├── 📄 .python-version # Python version specification (3.11)
│
├── 📖 README.md # Complete documentation (this file)
├── 📖 QUICKSTART.md # 5-minute quick start guide
├── 📖 PLAN.md # Implementation plan & architecture
├── 📖 CHANGELOG.md # Project history and changes
│
├── 🔧 setup_with_uv.sh # Interactive setup script
├── 🔧 run_comparison.sh # Multi-model comparison script
└── 🔧 example_commands.sh # Collection of example commands
Get started in 5 minutes:
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - Install dependencies:
uv pip install -e ".[openai]" - Set API key:
export OPENAI_API_KEY=your-key-here - Run evaluation:
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini - View results:
inspect view
For detailed instructions, see the sections below.
| File | Purpose |
|---|---|
| README.md | Complete documentation (you are here) - installation, usage, creating examples |
| QUICKSTART.md | Step-by-step beginner guide - get running in 5 minutes |
| PLAN.md | Implementation plan & architecture - design decisions and evaluation methodology |
| CHANGELOG.md | Project history and changes - what changed and when |
| examples/browser/README.md | Browser example documentation - task details and customization |
| Script | Purpose |
|---|---|
| setup_with_uv.sh | Interactive setup - checks/installs uv, selects providers, installs dependencies |
| run_comparison.sh | Multi-model testing - runs evaluation across multiple models systematically |
| example_commands.sh | Command reference - copy/paste individual commands, troubleshooting examples |
Requirements:
- Python 3.10+ (3.11 recommended)
- Docker (for sandboxing)
- uv package manager
- API key for at least one model provider
First, install uv - a fast Python package manager written in Rust.
Why uv?
- ⚡ 10-100x faster than pip for installations
- 🔒 Better dependency resolution with modern lockfiles
- 🎯 Single tool for package management, virtual environments, and Python version management
- 🔄 Drop-in replacement for pip commands
Installation:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with Homebrew
brew install uv
# Or with pip
pip install uv
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Option A: Using the Automated Setup Script (Recommended)
cd inspect-examples
./setup_with_uv.shThis interactive script will:
- Check if uv is installed (and install it if needed)
- Let you select which providers to install
- Install all dependencies
- Provide instructions for setting API keys
- Check if Docker is running
Option B: Manual Installation
# Install base inspect-ai
uv pip install inspect-ai
# Or use the project file to install with specific providers
cd inspect-examples
# Install with all providers
uv pip install -e ".[all-providers]"
# Or install with specific provider(s)
uv pip install -e ".[openai]"
uv pip install -e ".[openai,anthropic,mistral]"Using Virtual Environments:
# Create a virtual environment
uv venv
# Activate it (macOS/Linux)
source .venv/bin/activate
# Activate it (Windows)
.venv\Scripts\activate
# Install dependencies in the venv
uv pip install -e ".[all-providers]"OpenAI:
export OPENAI_API_KEY=your-openai-api-keyAnthropic:
export ANTHROPIC_API_KEY=your-anthropic-api-keyAI21:
export AI21_API_KEY=your-ai21-api-keyMistral:
export MISTRAL_API_KEY=your-mistral-api-keyGoogle:
export GOOGLE_API_KEY=your-google-api-keyCurrently configured:
- OpenAI: GPT-4, GPT-4o, GPT-4 Turbo
- Anthropic: Claude 3.5 Sonnet, Claude 3 Opus
- Mistral: Mistral Large, Mistral Medium
- AI21: Jamba 1.5 Large
- Google: Gemini 2.0 Flash, Gemini Pro
Install providers as needed:
# Single provider
uv pip install -e ".[openai]"
# Multiple providers
uv pip install -e ".[openai,anthropic,mistral]"
# All providers
uv pip install -e ".[all-providers]"The browser task uses Docker for sandboxing to securely execute model-generated code:
# Install Docker if you haven't already
# Visit: https://docs.docker.com/get-docker/
# Verify Docker is running
docker psThis project uses pyproject.toml for dependency management:
[project]
name = "inspect-examples"
dependencies = [
"inspect-ai>=0.3.0",
]
[project.optional-dependencies]
openai = ["openai>=1.0.0"]
anthropic = ["anthropic>=0.18.0"]
mistral = ["mistralai>=1.0.0"]
ai21 = ["ai21>=2.0.0"]
google = ["google-genai>=0.2.0"]
all-providers = [
# All provider packages
]Run the task with a specific model:
# Test with OpenAI GPT-4o-mini
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini
# Test with Anthropic Claude Haiku
inspect eval examples/browser/browser.py --model anthropic/claude-haiku-4-5
# Test with Mistral Small
inspect eval examples/browser/browser.py --model mistral/mistral-small-latestThis project includes convenience scripts for easier setup and comparison:
# Guided setup with uv
./setup_with_uv.sh
# Compare multiple models automatically
./run_comparison.sh# List available tasks in browser example
inspect list examples/browser/browser.py
# Run evaluation with specific model
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini
# Run the comparison script (evaluates multiple models)
./run_comparison.sh
# View all evaluation results
inspect view
# Check evaluation history
inspect historyAfter running evaluations, view the results in a web browser:
inspect viewThis opens a web-based interface showing:
- Task summaries
- Individual sample results
- Message histories
- Tool usage
- Scores and metrics
If you're using VS Code, install the Inspect VS Code Extension for integrated log viewing:
- Open VS Code
- Go to Extensions
- Search for "Inspect AI"
- Install the extension
By default, logs are saved to ./logs/ directory with timestamped filenames:
logs/
├── 2024-11-03T12-00-00_browser_gpt-4o-mini.json
├── 2024-11-03T12-15-00_browser_claude-haiku.json
└── ...
Each log contains:
- Complete message history
- Tool calls and responses
- Model outputs
- Scores and metrics
- Timing information
- Metadata
View logs with:
inspect view # Web interface
inspect view logs/specific-log.json # Specific log- Fast Setup: 5-minute installation with uv
- Multi-Model Support: Compare OpenAI, Anthropic, Mistral, Google, and AI21 models
- Sandboxed Execution: Docker isolation for secure code execution
- Visual Interface: Web-based log viewer with detailed insights
- Extensible: Easy to add custom evaluation tasks
- Comprehensive Documentation: Multiple guides for different use cases
Located in examples/browser/browser.py:
@task
def browser():
return Task(
dataset=[...], # Samples to evaluate
solver=[...], # How to solve (tools + generation)
scorer=model_graded_qa(), # How to score the results
sandbox="docker" # Security sandbox
)- Dataset: Contains input prompts for the model
- Solver: Chain of operations (use_tools + generate)
use_tools(web_browser()): Provides web browsing capabilitygenerate(): Generates the final response
- Scorer:
includes()checks if key information is present in the output - Sandbox: Docker container for secure execution
When comparing models, consider:
- Task Success Rate: Did the model complete the task?
- Tool Usage: How effectively did the model use the web_browser tool?
- Information Quality: How accurate and complete was the summary?
- Efficiency: How many steps/tokens did it take?
# 1. Activate environment (if using venv)
source .venv/bin/activate
# 2. Ensure API key is set
echo $OPENAI_API_KEY
# 3. Run evaluation
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini
# 4. View results
inspect viewEach example should be organized in its own subfolder within examples/:
Example Structure:
examples/
└── example_name/
├── __init__.py # Package initialization, exports tasks
├── task_file.py # Main task definition(s)
├── compose.yaml # Docker config (if using sandbox)
├── README.md # Example-specific documentation
└── [supporting files] # Data files, helpers, etc.
Step-by-Step Guide:
- Create Example Folder
mkdir examples/my_example
cd examples/my_example- Create Task File (
my_task.py)
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate
@task
def my_task():
return Task(
dataset=[
Sample(
input="Your evaluation prompt",
target="Expected answer"
)
],
solver=[generate()],
scorer=match()
)- Create Package Init (
__init__.py)
"""My Example - Brief description"""
from .my_task import my_task
__all__ = ["my_task"]- Create Documentation (
README.md)
# My Example
## What This Tests
Description of capabilities evaluated.
## Running the Task
\`\`\`bash
inspect eval examples/my_example/my_task.py --model openai/gpt-4o-mini
\`\`\`
## Customization
How to modify and extend the task.- Test Your Example
inspect list examples/my_example/my_task.py
inspect eval examples/my_example/my_task.py --model openai/gpt-4o-mini
inspect viewExample Categories:
Examples can cover various domains:
- Agentic Tasks: Tool usage, multi-step problem solving (e.g., browser navigation)
- Knowledge & Reasoning: Factual knowledge, logic, math problems
- Coding Tasks: Code generation, understanding, debugging
- Multimodal: Image, audio, video understanding
- Safety & Alignment: Content moderation, bias evaluation, safety testing
- Update
pyproject.toml:
[project.optional-dependencies]
newprovider = ["newprovider-sdk>=1.0.0"]- Install it:
uv pip install -e ".[newprovider]"- Set API key:
export NEWPROVIDER_API_KEY=your-key- Run evaluation:
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini- Edit the Python file:
nano examples/browser/browser.py- Run to test:
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini- View results:
inspect viewWhen you change functionality, update these files:
README.md- If changing features or usageQUICKSTART.md- If changing setup stepsPLAN.md- If changing architectureCHANGELOG.md- Document the change
Location: examples/browser/
Tests LLM ability to use web browsing tools for navigation and information extraction.
What it evaluates:
- Tool usage (web_browser)
- Website navigation
- Information extraction
- Question answering from web content
Running:
inspect eval examples/browser/browser.py --model openai/gpt-4o-miniDocumentation: See examples/browser/README.md
Location: examples/custom_scorer/
Demonstrates how to create custom scorers using RAGChecker for fine-grained evaluation with precision, recall, and F1 metrics.
What it evaluates:
- Response accuracy (precision)
- Response completeness (recall)
- Balanced quality (F1 score)
- Fine-grained claim-level analysis
Running:
# Install dependencies first
pip install ragchecker litellm
python -m spacy download en_core_web_sm
# Run evaluation
inspect eval examples/custom_scorer/custom_scorer.py --model openai/gpt-4o-miniDocumentation: See examples/custom_scorer/README.md and examples/custom_scorer/QUICKSTART.md
Planned examples (contributions welcome):
- Coding evaluation
- Multi-agent collaboration
- Reasoning chains
- Safety testing
- Multimodal tasks
- Long-context understanding
- Organize by example: Each example in its own subfolder (
browser/,coding/, etc.) - Document thoroughly: Include README.md with each example
- Use clear naming: Task names should describe what they evaluate
- Include samples: Provide diverse test cases in your dataset
- Set appropriate limits: Configure timeouts and token limits
- Test multiple models: Verify tasks work across providers
- Follow structure: Use the example template above
- Update docs: Add new examples to this README
- Use the setup script: For initial configuration
- Commit project files:
pyproject.tomland.python-versionto version control - Use uv for development:
uv pip install -e ".[provider]"
- Mix unrelated tasks: Keep examples focused on specific capabilities
- Skip documentation: Always include README.md in example folders
- Put files in root: Keep all evaluation code in
examples/ - Hardcode paths: Make examples portable
- Commit secrets: Never commit API keys to version control
- Commit
.venv/: Add to.gitignore - Commit
uv.lock: Unless you want exact versions for collaboration - Edit
uv.lockmanually: Let uv manage it - Mix pip and uv: Choose one package manager per environment
- Forget dependencies: Document any special requirements
- Python >= 3.10 (3.11 recommended)
- Docker - For sandboxing
- inspect-ai >= 0.3.0 - Core framework
- openai >= 1.0.0 - For OpenAI models
- anthropic >= 0.18.0 - For Anthropic Claude
- mistralai >= 1.0.0 - For Mistral models
- ai21 >= 2.0.0 - For AI21 Jamba
- google-genai >= 0.2.0 - For Google Gemini
| Variable | Required For | Format |
|---|---|---|
OPENAI_API_KEY |
OpenAI models | sk-... |
ANTHROPIC_API_KEY |
Anthropic models | sk-ant-... |
MISTRAL_API_KEY |
Mistral models | String |
AI21_API_KEY |
AI21 models | String |
GOOGLE_API_KEY |
Google models | String |
If you get Docker-related errors:
# Check if Docker is running
docker ps
# If not, start Docker Desktop (macOS/Windows)
# Or start Docker daemon (Linux)
sudo systemctl start dockerIf you get authentication errors:
# Verify your API key is set
echo $OPENAI_API_KEY # or other provider
# Re-export if needed
export OPENAI_API_KEY=your-key-hereMake sure you've installed the provider package:
uv pip install openai # or anthropic, mistralai, etc.
# Or use the project extras: uv pip install -e ".[openai]"If uv is not recognized after installation:
# Restart your terminal or source your profile
source ~/.zshrc # or ~/.bashrc
# Or reinstall uv
curl -LsSf https://astral.sh/uv/install.sh | shIf you get import errors:
# Reinstall the package with dependencies
uv pip install -e ".[openai]"
# Verify installation
uv pip list | grep inspect-aiIf you get permission errors with uv:
# Don't use sudo with uv
uv pip install -e ".[openai]"If you encounter dependency conflicts:
# Try reinstalling in a fresh environment
rm -rf .venv
uv venv
source .venv/bin/activate
uv pip install -e ".[all-providers]"A: Recommended but not required. You can still use pip, but uv is much faster and better for modern Python projects.
A: Technically yes, but not recommended. Choose one and stick with it in a given environment.
A: No need to change anything. But if you want to migrate, just install uv and reinstall dependencies with uv pip install -e ".[all-providers]"
A: No! All inspect eval commands remain exactly the same.
A: uv works great with venvs. Use uv venv to create one, or use your existing venv with uv pip.
A: Yes! uv is built by the creators of Ruff and is production-ready. It's used by many large projects.
A: You can always fall back to pip. The pyproject.toml works with both pip and uv.
Before considering an example complete:
- ✅ Run with multiple models: Test OpenAI, Anthropic, etc.
- ✅ Verify scores: Ensure scoring works as expected
- ✅ Check logs: Review output in inspect view
- ✅ Test edge cases: Try with limited samples, timeouts
- ✅ Document limitations: Note any known issues in README
- Try the browser example: Run your first evaluation
- Create your own example: Follow the structure guide above
- Modify existing tasks: Edit dataset samples to test different scenarios
- Experiment with solvers: Try different solver chains (e.g., add chain_of_thought)
- Custom scoring: Implement more sophisticated scorers for detailed evaluation
- Add metadata: Track additional metrics in your evaluations
# Installation
curl -LsSf https://astral.sh/uv/install.sh | sh
# Project setup
cd inspect-examples
uv pip install -e ".[openai]"
export OPENAI_API_KEY=your-key
# Run evaluation
inspect eval examples/browser/browser.py --model openai/gpt-4o-mini
# View results
inspect view
# List available tasks
inspect list examples/browser/browser.py
# Check evaluation history
inspect history
# Common uv commands
uv pip list # List installed packages
uv pip tree # Show dependency tree
uv pip freeze # Freeze current environment
uv venv # Create virtual environment- Inspect AI Documentation
- Inspect AI GitHub
- Task Creation Guide
- Tool Development
- Scoring Methods
- Browser Tools Guide
- Model Providers
- Review existing examples in
examples/for patterns - Check this README for project structure
- Read Inspect AI documentation
- Open issues on GitHub for questions