refactor: get rid of Langchain dependency for document chunking and q…

…uerying the Vector Database (#9) * feat: add document loader * refactor: splits returns documents * feat: add text splitter * refactor: move to unstructured * chore: comment * refactor: refactor references * chore: updater README.md * chore: updater README.md * refactor: get rid of langchain fully * chore: update the README.md * refactor: refactored embedder and chroma client * refactor: refactored chroma client and text splitter * chore: updated todo * refactor: move vector database to memory * refactor: move vector database to memory * refactor: add Chroma unit tests * refactor: drop vector memory class * chore: update README * chore: reformat * chore: reformat * chore: bump version
umbertogriffo · Dec 7, 2024 · f91e37a · f91e37a
1 parent 58a3e5a
commit f91e37a
Show file tree

Hide file tree

Showing 28 changed files with 1,205 additions and 1,197 deletions.
diff --git a/README.md b/README.md
@@ -16,10 +16,8 @@
 > GitHub [issue](https://github.com/abetlen/llama-cpp-python/issues).
 
 > [!WARNING]
-> lama_cpp_pyhon doesn't use GPU on M1 if you are running an x86 version of Python. More info [here](https://github.com/abetlen/llama-cpp-python/issues/756#issuecomment-1870324323)
-
-> [!WARNING]
-> Note: it's important to note that the large language model sometimes generates hallucinations or false information.
+> - `lama_cpp_pyhon` doesn't use `GPU` on `M1` if you are running an `x86` version of `Python`. More info [here](https://github.com/abetlen/llama-cpp-python/issues/756#issuecomment-1870324323).
+> - It's important to note that the large language model sometimes generates hallucinations or false information.
 
 ## Table of contents
 
@@ -40,13 +38,14 @@
 
 ## Introduction
 
-This project combines the power
-of [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [LangChain](https://python.langchain.com/docs/get_started/introduction.html) (only used for document chunking and querying the Vector Database, and we plan to
-eliminate it entirely), [Chroma](https://github.com/chroma-core/chroma) and [Streamlit](https://discuss.streamlit.io/) to build:
+This project combines the power of [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [Chroma](https://github.com/chroma-core/chroma) and [Streamlit](https://discuss.streamlit.io/) to build:
 
 * a Conversation-aware Chatbot (ChatGPT like experience).
 * a RAG (Retrieval-augmented generation) ChatBot.
 
+> [!NOTE]
+> We decided to utilize and refactor the `RecursiveCharacterTextSplitter` class from `LangChain` to properly chunk Markdown.
+
 The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the
 corresponding answer
 based on the context provided by those files.
@@ -162,15 +161,15 @@ and put them under `docs`.
 Run:
 
 ```shell
-python chatbot/memory_builder.py --chunk-size 1000
+python chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50
 ```
 
 ## Run the Chatbot
 
 To interact with a GUI type:
 
 ```shell
-streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024
+streamlit run chatbot/chatbot_app.py -- --model llama-3 --max-new-tokens 1024
 ```
 
 ![conversation-aware-chatbot.gif](images/conversation-aware-chatbot.gif)
@@ -180,7 +179,7 @@ streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 10
 To interact with a GUI type:
 
 ```shell
-streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthesis-strategy async-tree-summarization
+streamlit run chatbot/rag_chatbot_app.py -- --model llama-3 --k 2 --synthesis-strategy async-tree-summarization
 ```
 
 ![rag_chatbot_example.gif](images%2Frag_chatbot_example.gif)
@@ -193,21 +192,13 @@ streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthes
 
 * LLMs:
     * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
-    * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
     * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
     * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
-* LLM integration and Modules:
-    * [LangChain](https://python.langchain.com/docs/get_started/introduction.html):
-        * [MarkdownTextSplitter](https://api.python.langchain.com/en/latest/_modules/langchain/text_splitter.html#MarkdownTextSplitter)
-        * [Chroma Integration](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma)
-        * [The Problem With LangChain](https://minimaxir.com/2023/07/langchain-problem/#:~:text=The%20problem%20with%20LangChain%20is,don't%20start%20with%20LangChain)
 * Embeddings:
     * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
         * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
           space and can be used for tasks like clustering or semantic search.
 * Vector Databases:
-    * [Chroma](https://www.trychroma.com/)
-    * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
     * Indexing algorithms:
         * There are many algorithms for building indexes to optimize vector search. Most vector databases
           implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
@@ -218,13 +209,16 @@ streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthes
             * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
             * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
               expense of speed.
+    * [Chroma](https://www.trychroma.com/)
+    * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
 * Retrieval Augmented Generation (RAG):
+    * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
     * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
         * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
           we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
     * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
+    * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
     * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
-    * [Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs](https://www.ninetack.io/post/improving-rag-quality-by-summarization)
     * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
 * Chatbot Development:
     * [Streamlit](https://discuss.streamlit.io/):

diff --git a/chatbot/bot/client/lama_cpp_client.py b/chatbot/bot/client/lama_cpp_client.py
@@ -1,6 +1,6 @@
 import os
 from pathlib import Path
-from typing import Any, Iterator, Union
+from typing import Any, Iterator
 
 import requests
 from llama_cpp import CreateCompletionResponse, CreateCompletionStreamResponse, Llama
@@ -158,7 +158,7 @@ def stream_answer(self, prompt: str, max_new_tokens: int = 512) -> str:
 
     def start_answer_iterator_streamer(
         self, prompt: str, max_new_tokens: int = 512
-    ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
+    ) -> CreateCompletionResponse | Iterator[CreateCompletionStreamResponse]:
         """
         Abstract method to start an answer iterator streamer for a given prompt.
 
@@ -181,7 +181,7 @@ def start_answer_iterator_streamer(
 
     async def async_start_answer_iterator_streamer(
         self, prompt: str, max_new_tokens: int = 512
-    ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
+    ) -> CreateCompletionResponse | Iterator[CreateCompletionStreamResponse]:
         """
         This abstract method should be implemented to asynchronously start an answer iterator streamer,
         providing a flexible way to generate answers in a streaming fashion based on the given prompt.
@@ -203,10 +203,12 @@ async def async_start_answer_iterator_streamer(
 
         return stream
 
-    def parse_token(self, token):
+    @staticmethod
+    def parse_token(token):
         return token["choices"][0]["delta"].get("content", "")
 
-    def generate_qa_prompt(self, question: str) -> str:
+    @staticmethod
+    def generate_qa_prompt(question: str) -> str:
         """
         Generates a question-answering (QA) prompt using predefined templates.
 
@@ -222,7 +224,8 @@ def generate_qa_prompt(self, question: str) -> str:
             question=question,
         )
 
-    def generate_ctx_prompt(self, question: str, context: str) -> str:
+    @staticmethod
+    def generate_ctx_prompt(question: str, context: str) -> str:
         """
         Generates a context-based prompt using predefined templates.
 
@@ -240,7 +243,8 @@ def generate_ctx_prompt(self, question: str, context: str) -> str:
             context=context,
         )
 
-    def generate_refined_ctx_prompt(self, question: str, context: str, existing_answer: str) -> str:
+    @staticmethod
+    def generate_refined_ctx_prompt(question: str, context: str, existing_answer: str) -> str:
         """
         Generates a refined prompt for question-answering with existing answer.
 
@@ -260,15 +264,17 @@ def generate_refined_ctx_prompt(self, question: str, context: str, existing_answ
             existing_answer=existing_answer,
         )
 
-    def generate_refined_question_conversation_awareness_prompt(self, question: str, chat_history: str) -> str:
+    @staticmethod
+    def generate_refined_question_conversation_awareness_prompt(question: str, chat_history: str) -> str:
         return generate_conversation_awareness_prompt(
             template=REFINED_QUESTION_CONVERSATION_AWARENESS_PROMPT_TEMPLATE,
             system=SYSTEM_TEMPLATE,
             question=question,
             chat_history=chat_history,
         )
 
-    def generate_refined_answer_conversation_awareness_prompt(self, question: str, chat_history: str) -> str:
+    @staticmethod
+    def generate_refined_answer_conversation_awareness_prompt(question: str, chat_history: str) -> str:
         return generate_conversation_awareness_prompt(
             template=REFINED_ANSWER_CONVERSATION_AWARENESS_PROMPT_TEMPLATE,
             system=SYSTEM_TEMPLATE,

diff --git a/chatbot/bot/conversation/conversation_retrieval.py b/chatbot/bot/conversation/conversation_retrieval.py
@@ -1,8 +1,8 @@
 from asyncio import get_event_loop
-from typing import Any, List, Tuple
+from typing import Any
 
+from entities.document import Document
 from helpers.log import get_logger
-from langchain_core.documents import Document
 
 from bot.client.lama_cpp_client import LamaCppClient
 from bot.conversation.ctx_strategy import AsyncTreeSummarizationStrategy, BaseSynthesisStrategy
@@ -30,7 +30,7 @@ def __init__(self, llm: LamaCppClient) -> None:
         self.llm = llm
         self.chat_history = []
 
-    def get_chat_history(self) -> List[Tuple[str, str]]:
+    def get_chat_history(self) -> list[tuple[str, str]]:
         """
         Gets the chat history.
 
@@ -40,7 +40,7 @@ def get_chat_history(self) -> List[Tuple[str, str]]:
         """
         return self.chat_history
 
-    def update_chat_history(self, question: str, answer: str) -> List[Tuple[str, str]]:
+    def update_chat_history(self, question: str, answer: str) -> list[tuple[str, str]]:
         """
         Updates the chat history.
 
@@ -57,7 +57,7 @@ def update_chat_history(self, question: str, answer: str) -> List[Tuple[str, str
 
         return self.chat_history
 
-    def keep_chat_history_size(self, max_size: int = 2) -> List[Tuple[str, str]]:
+    def keep_chat_history_size(self, max_size: int = 2) -> list[tuple[str, str]]:
         """
         Keeps the list of chat history at the specified maximum size by popping out the oldest elements.
 
@@ -160,7 +160,7 @@ def answer(self, question: str, max_new_tokens: int = 512) -> Any:
     def context_aware_answer(
         ctx_synthesis_strategy: BaseSynthesisStrategy,
         question: str,
-        retrieved_contents: List[Document],
+        retrieved_contents: list[Document],
         max_new_tokens: int = 512,
     ):
         if isinstance(ctx_synthesis_strategy, AsyncTreeSummarizationStrategy):

diff --git a/chatbot/bot/conversation/ctx_strategy.py b/chatbot/bot/conversation/ctx_strategy.py
@@ -1,10 +1,10 @@
 import asyncio
 from enum import Enum
-from typing import Any, List, Union
+from typing import Any
 
 import nest_asyncio
+from entities.document import Document
 from helpers.log import get_logger
-from langchain_core.documents import Document
 
 from bot.client.lama_cpp_client import LamaCppClient
 
@@ -35,7 +35,7 @@ def __init__(self, llm: LamaCppClient) -> None:
         """
         self.llm = llm
 
-    def generate_response(self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512):
+    def generate_response(self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512):
         """
         Generate a response using the synthesis strategy.
 
@@ -61,8 +61,8 @@ def __init__(self, llm: LamaCppClient):
         super().__init__(llm)
 
     def generate_response(
-        self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512
-    ) -> Union[str, Any]:
+        self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512
+    ) -> str | Any:
         """
         Generate a response using create and refine strategy.
 
@@ -126,7 +126,7 @@ def __init__(self, llm: LamaCppClient):
         super().__init__(llm)
 
     def generate_response(
-        self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512, num_children: int = 2
+        self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512, num_children: int = 2
     ) -> Any:
         """
         Generate a response using hierarchical summarization strategy.
@@ -170,9 +170,9 @@ def generate_response(
 
     def combine_results(
         self,
-        texts: List[str],
+        texts: list[str],
         question: str,
-        cur_prompt_list: List[str],
+        cur_prompt_list: list[str],
         max_new_tokens: int = 512,
         num_children: int = 2,
     ) -> Any:
@@ -227,7 +227,7 @@ def __init__(self, llm: LamaCppClient):
 
     async def generate_response(
         self,
-        retrieved_contents: List[Document],
+        retrieved_contents: list[Document],
         question: str,
         max_new_tokens: int = 512,
         num_children: int = 2,
@@ -278,9 +278,9 @@ async def generate_response(
 
     async def combine_results(
         self,
-        texts: List[str],
+        texts: list[str],
         question: str,
-        cur_prompt_list: List[str],
+        cur_prompt_list: list[str],
         max_new_tokens: int = 512,
         num_children: int = 2,
     ):

diff --git a/chatbot/bot/memory/embedder.py b/chatbot/bot/memory/embedder.py
@@ -1,16 +1,49 @@
-from abc import ABC
 from typing import Any
 
-from langchain.embeddings import HuggingFaceEmbeddings
+import sentence_transformers
 
 
-class Embedder(ABC):
-    embedder: Any
+class Embedder:
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2", cache_folder: str | None = None, **kwargs: Any):
+        """
+        Initialize the Embedder class with the specified parameters.
 
-    def get_embedding(self):
-        return self.embedder
+        Args:
+            **kwargs (Any): Additional keyword arguments to pass to the SentenceTransformer model.
+        """
+        self.client = sentence_transformers.SentenceTransformer(model_name, cache_folder=cache_folder, **kwargs)
 
+    def embed_documents(self, texts: list[str], multi_process: bool = False, **encode_kwargs: Any) -> list[list[float]]:
+        """
+        Compute document embeddings using a transformer model.
 
-class EmbedderHuggingFace(Embedder):
-    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
-        self.embedder = HuggingFaceEmbeddings(model_name=model_name)
+        Args:
+            texts (list[str]): The list of texts to embed.
+            multi_process (bool): If True, use multiple processes to compute embeddings.
+            **encode_kwargs (Any): Additional keyword arguments to pass when calling the `encode` method of the model.
+
+        Returns:
+            list[list[float]]: A list of embeddings, one for each text.
+        """
+
+        texts = list(map(lambda x: x.replace("\n", " "), texts))
+        if multi_process:
+            pool = self.client.start_multi_process_pool()
+            embeddings = self.client.encode_multi_process(texts, pool)
+            sentence_transformers.SentenceTransformer.stop_multi_process_pool(pool)
+        else:
+            embeddings = self.client.encode(texts, show_progress_bar=True, **encode_kwargs)
+
+        return embeddings.tolist()
+
+    def embed_query(self, text: str) -> list[float]:
+        """
+        Compute query embeddings using a transformer model.
+
+        Args:
+            text (str): The text to embed.
+
+        Returns:
+            list[float]: Embeddings for the text.
+        """
+        return self.embed_documents([text])[0]