flowchart TD
subgraph "1. Document Processing"
A[Documents] --> B[Split Text into Chunks]
B --> C1[Chunk-1]
B --> C2[Chunk-2]
B --> C3[Chunk-n]
end
subgraph "2. Document Embedding"
EM1{{Embedding Model}}
C1 & C2 & C3 --> EM1
EM1 --> D1[Embedding-1] & D2[Embedding-2] & D3[Embedding-3]
end
subgraph "3. Indexing"
D1 & D2 & D3 --> E[(VectorDB)]
end
subgraph "4. Query Processing"
F[Query] --> EM2{{Embedding Model}}
EM2 --> G[Query Embedding]
end
subgraph "5. Retrieval"
G -->|Similarity Search| E
E -->|Top-K Retrieval| H[Relevant Chunks]
end
subgraph "6. Context Formation"
H --> I[Query + Relevant Chunks]
end
subgraph "7. Generation"
I --> J[LLM]
J --> K[Response]
end
F --> I
This guide walks you through creating a simple Retrieval-Augmented Generation (RAG) system using pure Python. We will use an embedding model and a language model (LLM) to retrieve relevant documents and generate responses based on a user's query.
The whole process can be factored into two big steps:
- Knowledge Base Creation
- Generation Part
To get started, you'll first need a knowledge base (documents, PDFs, wiki pages). This is the fodder for your language models (LLMs). The process involves:
- Chunking: Split the text into chunks of sub-documents to simplify ingestion.
- Embedding: Compute numerical embeddings for each chunk to understand the semantic similarity to queries.
- Storage: Store the embeddings in a way that allows quick retrieval. While a vector store/DB is often used, this tutorial shows that it's not essential.
When a user query comes in, an embedding is computed for the query, and we retrieve the most relevant chunks from the knowledge base. These relevant chunks are appended to the initial user query, forming a context that is fed into the LLM to generate a response.
A bunch of packages that need to be installed before we get going.
sentence-transformers
: For embedding the documents and queries.numpy
: For similarity comparisons.scipy
: For advanced similarity computations.wikipedia-api
: For loading a Wikipedia page as a knowledge base.textwrap
: For formatting output text.
!pip install -q sentence-transformers
!pip install -q wikipedia-api
!pip install -q numpy
!pip install -q scipy
Let's load the embedding model of our choice. In this tutorial, we are using the gte-base-en-v1.5
.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Alibaba-NLP/gte-base-en-v1.5", trust_remote_code=True)
The gte-base-en-v1.5
model is an open-source English model provided by Alibaba's NLP team. It is part of the GTE (General Text Embeddings) family, designed for generating high-quality embeddings suitable for various natural language processing tasks. The model is optimized for capturing semantic meaning in English text, making it useful for tasks like sentence similarity, semantic search, and clustering. The trust_remote_code=True
parameter allows the use of custom code associated with the model, ensuring that it functions as intended.
-
Start by loading a Wikipedia article as your knowledge base. The text will be split into manageable chunks (sub-documents), usually by paragraphs.
from wikipediaapi import Wikipedia wiki = Wikipedia('RAGBot/0.0', 'en') doc = wiki.page('Hayao_Miyazaki').text paragraphs = doc.split('\n\n') # chunking
-
While there are a ton of chunking strategies available, many of them don't work as expected. It's best to review your knowledge base (KB) and determine which strategy suits it best. In this case, we'll chunk the document based on paragraphs.
-
If you want to view how these chunks look, import the
textwrap
library, and enumerate over each paragraph to print them.import textwrap for i, p in enumerate(paragraphs): wrapped_text = textwrap.fill(p, width=100) print("-----------------------------------------------------------------") print(wrapped_text) print("-----------------------------------------------------------------")
-
If your document contains images and tables, it is recommended to extract them separately and embed them using vision models.
-
Next, encode the text data (in this case, the
paragraphs
) into embeddings by calling theencode
method on our model.docs_embed = model.encode(paragraphs, normalize_embeddings=True)
-
These embeddings are dense vector representations of text that capture semantic meaning, allowing the model to understand and process text in a mathematical form.
-
We are normalizing the embeddings here.
-
What is normalization? It's a process that adjusts the values of the embeddings to have a unit norm (i.e., the length of the vector is 1).
-
Why normalize? Normalized embeddings ensure that the distance between vectors is primarily due to differences in direction rather than magnitude. This can improve the performance of models in tasks like similarity search, where you want to compare how "close" or "similar" different pieces of text are.
-
-
The result,
docs_embed
, is a collection of vector representations of your text data, where each vector corresponds to a paragraph in theparagraphs
list. -
The
shape
command gives the number of chunks and the dimension of each embedded vector. (Note that the size of the embedding vector depends on the type of embedding model.)docs_embed.shape
-
You can then view how the actual embeddings look, which is an array of normalized numbers.
docs_embed[0]
Let's embed a sample user query in a similar fashion to how we embedded the document.
query = "What was Studio Ghibli's first film?"
query_embed = model.encode(query, normalize_embeddings=True)
You can check the shape of the query_embed to confirm the dimensions of the embedded query.
query_embed.shape
One of the simplest ways to retrive of the most relevant chunks would be to compute the dot product of your document embedding and the query embedding.
The dot product is a mathematical operation that multiplies corresponding elements of two vectors (or matrices) and sums the results. It is commonly used to measure the similarity between two vectors.
(Notice that the transpose of the query_embed
vector is taken for computing the dot product).
import numpy as np
similarities = np.dot(docs_embed, query_embed.T)
The .shape attribute of a NumPy array returns a tuple representing the dimensions of the array.
similarities.shape
The expected shape in this code would be:
-
If docs_embed has a shape of (n_docs, n_dim):
- n_docs is the number of documents.
- n_dim is the dimensionality of each document embedding.
-
query_embed.T would have a shape of (n_dim, 1) since we are comparing against a single query.
-
The resulting similarities array, after the dot product, will have a shape of (n_docs,), meaning it’s a 1-dimensional array (a vector) with n_docs elements. Each element represents the similarity score between the query and one of the documents.
-
Why Check the Shape? Ensuring that the shape is as expected (n_docs,) confirms that the dot product was performed correctly and that each document’s similarity score was computed.
You can print the similarities
array to inspect the similarity scores, where each value corresponds to a dot product result:
print(similarities)
The dot product between two vectors (embeddings) measures their similarity: Higher values indicate greater similarity between the query and the document. If the embeddings are normalized, these values are directly proportional to the cosine similarity between the vectors. If not normalized, they still indicate similarity, but also reflect the magnitudes of the embeddings.
To identify the top 3 most similar documents based on the similarity scores, you can use the following code:
top_3_idx = np.argsort(similarities, axis=0)[-3:][::-1].tolist()
-
np.argsort(similarities, axis=0): This function sorts the indices of the similarities array in ascending order. For example, if similarities = [0.1, 0.7, 0.4], np.argsort would return [0, 2, 1], where 0 is the index of the smallest value and 1 is the index of the largest value.
-
[-3:]: This slice operation selects the indices of the top 3 highest similarity scores (the last 3 elements after sorting).
-
[::-1]: This reverses the order, so the indices are now in descending order of similarity.
-
tolist(): Converts the array of indices into a Python list. Result: top_3_idx contains the indices of the top 3 most similar documents, in order of descending similarity.
most_similar_documents = [paragraphs[idx] for idx in top_3_idx]
- List Comprehension: This line creates a list called most_similar_documents, which contains the actual paragraphs from the paragraphs list that correspond to the indices in top_3_idx.
- paragraphs[idx]: For each index in top_3_idx, this retrieves the corresponding paragraph.
The CONTEXT variable is initially initialized as an empty string and will later be appended with the wrapped text of the most similar documents in an enumerate loop.
CONTEXT = ""
for i, p in enumerate(most_similar_documents):
wrapped_text = textwrap.fill(p, width=100)
print("-----------------------------------------------------------------")
print(wrapped_text)
print("-----------------------------------------------------------------")
CONTEXT += wrapped_text + "\n\n"
So, now we have a query + relevant chunks, which together will be fed to the LLM.
query = "What was Studio Ghibli's first film?"
prompt = f"""
use the following CONTEXT to answer the QUESTION at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
CONTEXT: {CONTEXT}
QUESTION: {query}
"""
-
Install OpenAI to access and use LLMs.
!pip install -q openai
-
Enable access to the OpenAI API key (can be set up in secrets on Google Colab).
from google.colab import userdata userdata.get('openai') import openai
-
Create an OpenAI client.
from openai import OpenAI client = OpenAI(api_key=userdata.get('openai'))
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": prompt},
]
)
-
client.chat.completions.create: This is a method call to create a new completion (response) using a chat-based language model.
-
client: Represents the API client object connected to the service (here, OpenAI).
-
chat.completions.create: Specifies that you're making a request to create a chat-based completion.
-
model="gpt-4o": Specifies the model you want to use to generate the response. "gpt-4o" refers to a specific variant of the GPT-4 model. Different models can have different behaviors, fine-tuning, or capabilities, so specifying the model is important to ensure you get the desired output.
-
messages: This parameter is a list of message objects that represent the conversation history. It allows the model to understand the context of the chat. In this case, we're providing only one message in the list:
{"role": "user", "content": prompt}
-
role: "user" indicates the role of the message sender, who is interacting with the model.
-
content: This contains the actual text of the message sent by the user. The variable prompt holds this text, which the model will use as input to generate a response.
When you make a request to an API like OpenAI's GPT models to generate a chat completion, the response usually comes back in a structured format, often as a dictionary.
This structure typically includes:
- choices: A list (array) containing different possible completions generated by the model. Each item in this list represents one possible completion or response.
- message: An object or dictionary within each choice that contains the actual content of the message generated by the model.
- content: The text content of the message, which is the actual response or completion generated by the model.
print(response.choices[0].message.content)
We select the first item in the choices list, then access the messages object within it. Finally we access the content field within the message, which contains the actual text generated by the model.
And that brings us to an end on building a RAG system from scratch. It's highly recommended to build your intial RAG setup in pure Python to get a better understanding of how these systems work.