AI Evaluations Cookbooks

A collection of practical notebooks demonstrating systematic approaches to building, evaluating, and improving AI applications. These cookbooks provide hands-on guidance for developing more effective AI systems through data-driven methodologies.

Overview

Building effective AI applications requires more than just connecting to the latest LLM API. This repository provides structured approaches to developing systems that are reliable, efficient, and continuously improving. Each notebook in this collection focuses on a specific technique and walks through a methodical process for:

Establishing evaluation frameworks - Creating robust metrics to measure performance
Systematic improvement - Using data-driven approaches to enhance capabilities
Performance visualization - Tracking improvements and identifying bottlenecks

Notebooks

Simple RAG Application

A step-by-step guide to building and improving a Retrieval-Augmented Generation (RAG) application. This notebook covers:

Implementing effective retrieval strategies
Evaluating RAG performance with meaningful metrics
Systematically improving retrieval and generation quality

Fine-Tuning Embedding Models

Learn how to fine-tune embedding models to significantly improve retrieval performance. This notebook covers:

Fine-tuning open-source embedding models using triplet loss
Evaluating and visualizing performance improvements
Applying techniques from industry case studies (like Ramp's transaction categorization)

Metadata Filtering

Explore how to enhance retrieval performance by implementing metadata filtering in RAG applications. This notebook covers:

Implementing both semantic search and metadata-filtered search approaches
Evaluating and comparing approaches using industry-standard metrics
Drawing data-driven insights to optimize your own retrieval systems

Evaluating Tool Selection in AI Assistants

Learn how to measure and improve tool calling capabilities in AI assistants using precision and recall metrics. This notebook covers:

Creating a framework for evaluating tool selection decisions
Analyzing per-tool performance to identify specific improvement areas
Systematically enhancing multi-tool coordination for complex tasks

Getting Started

Clone this repository
Install the required dependencies: pip install -r requirements.txt
Open the notebooks in Jupyter or your preferred notebook environment
Follow along with the step-by-step instructions

Contributing

Contributions are welcome! If you have ideas for new notebooks or improvements to existing ones, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
notebooks		notebooks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Evaluations Cookbooks

Overview

Notebooks

Simple RAG Application

Fine-Tuning Embedding Models

Metadata Filtering

Evaluating Tool Selection in AI Assistants

Getting Started

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

langwatch/cookbooks

Folders and files

Latest commit

History

Repository files navigation

AI Evaluations Cookbooks

Overview

Notebooks

Simple RAG Application

Fine-Tuning Embedding Models

Metadata Filtering

Evaluating Tool Selection in AI Assistants

Getting Started

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages