A hands-on workshop exploring how to work with text embeddings for search and retrieval, using modern Python tools and libraries.
Companion to the talk "How to teach new things to your AI".
This workshop teaches the fundamentals of working with text embeddings through a practical Jupyter notebook that guides participants through:
- Text extraction from PDFs
- Semantic text chunking
- Creating and working with embeddings
- Vector similarity search
- Reranking search results
- Building a simple RAG (Retrieval Augmented Generation) system
- Python 3.12
- Basic familiarity with Python and Jupyter notebooks
- Understanding of basic NLP concepts
- A text editor (VS Code recommended)
-
Install Python 3.12 using a version manager like:
-
Clone this repository and navigate to the project directory:
git clone [repository-url]
cd [repository-name]
- Create and activate a virtual environment:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows
- Install dependencies:
uv pip install -r requirements.txt
- Launch Jupyter Notebook:
jupyter notebook
- Open
embeddings.ipynb
and follow along with the tutorial.
- How to extract and process text from PDF documents
- Techniques for semantic text chunking
- Creating and working with text embeddings
- Implementing vector similarity search using DuckDB
- Using rerankers to improve search results
- Building a simple question-answering system
- MTEB Leaderboard - Compare embedding models
- Sentence Transformers Documentation
- DuckDB Documentation
- PyMuPDF Documentation