Skip to content

Commit

Permalink
refactor: get rid of Langchain dependency for document chunking and q…
Browse files Browse the repository at this point in the history
…uerying the Vector Database (#9)

* feat: add document loader

* refactor: splits returns documents

* feat: add text splitter

* refactor: move to unstructured

* chore: comment

* refactor: refactor references

* chore: updater README.md

* chore: updater README.md

* refactor: get rid of langchain fully

* chore: update the README.md

* refactor: refactored embedder and chroma client

* refactor: refactored chroma client and text splitter

* chore: updated todo

* refactor: move vector database to memory

* refactor: move vector database to memory

* refactor: add Chroma unit tests

* refactor: drop vector memory class

* chore: update README

* chore: reformat

* chore: reformat

* chore: bump version
  • Loading branch information
umbertogriffo authored Dec 7, 2024
1 parent 58a3e5a commit f91e37a
Show file tree
Hide file tree
Showing 28 changed files with 1,205 additions and 1,197 deletions.
32 changes: 13 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@
> GitHub [issue](https://github.com/abetlen/llama-cpp-python/issues).
> [!WARNING]
> lama_cpp_pyhon doesn't use GPU on M1 if you are running an x86 version of Python. More info [here](https://github.com/abetlen/llama-cpp-python/issues/756#issuecomment-1870324323)
> [!WARNING]
> Note: it's important to note that the large language model sometimes generates hallucinations or false information.
> - `lama_cpp_pyhon` doesn't use `GPU` on `M1` if you are running an `x86` version of `Python`. More info [here](https://github.com/abetlen/llama-cpp-python/issues/756#issuecomment-1870324323).
> - It's important to note that the large language model sometimes generates hallucinations or false information.
## Table of contents

Expand All @@ -40,13 +38,14 @@

## Introduction

This project combines the power
of [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [LangChain](https://python.langchain.com/docs/get_started/introduction.html) (only used for document chunking and querying the Vector Database, and we plan to
eliminate it entirely), [Chroma](https://github.com/chroma-core/chroma) and [Streamlit](https://discuss.streamlit.io/) to build:
This project combines the power of [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [Chroma](https://github.com/chroma-core/chroma) and [Streamlit](https://discuss.streamlit.io/) to build:

* a Conversation-aware Chatbot (ChatGPT like experience).
* a RAG (Retrieval-augmented generation) ChatBot.

> [!NOTE]
> We decided to utilize and refactor the `RecursiveCharacterTextSplitter` class from `LangChain` to properly chunk Markdown.
The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the
corresponding answer
based on the context provided by those files.
Expand Down Expand Up @@ -162,15 +161,15 @@ and put them under `docs`.
Run:

```shell
python chatbot/memory_builder.py --chunk-size 1000
python chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50
```

## Run the Chatbot

To interact with a GUI type:

```shell
streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024
streamlit run chatbot/chatbot_app.py -- --model llama-3 --max-new-tokens 1024
```

![conversation-aware-chatbot.gif](images/conversation-aware-chatbot.gif)
Expand All @@ -180,7 +179,7 @@ streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 10
To interact with a GUI type:

```shell
streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthesis-strategy async-tree-summarization
streamlit run chatbot/rag_chatbot_app.py -- --model llama-3 --k 2 --synthesis-strategy async-tree-summarization
```

![rag_chatbot_example.gif](images%2Frag_chatbot_example.gif)
Expand All @@ -193,21 +192,13 @@ streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthes

* LLMs:
* [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
* [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
* [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
* [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
* LLM integration and Modules:
* [LangChain](https://python.langchain.com/docs/get_started/introduction.html):
* [MarkdownTextSplitter](https://api.python.langchain.com/en/latest/_modules/langchain/text_splitter.html#MarkdownTextSplitter)
* [Chroma Integration](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma)
* [The Problem With LangChain](https://minimaxir.com/2023/07/langchain-problem/#:~:text=The%20problem%20with%20LangChain%20is,don't%20start%20with%20LangChain)
* Embeddings:
* [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
space and can be used for tasks like clustering or semantic search.
* Vector Databases:
* [Chroma](https://www.trychroma.com/)
* [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
* Indexing algorithms:
* There are many algorithms for building indexes to optimize vector search. Most vector databases
implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
Expand All @@ -218,13 +209,16 @@ streamlit run chatbot/rag_chatbot_app.py -- --model openchat-3.6 --k 2 --synthes
* [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
* > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
expense of speed.
* [Chroma](https://www.trychroma.com/)
* [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
* Retrieval Augmented Generation (RAG):
* [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
* [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
* > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
* [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
* [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
* [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
* [Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs](https://www.ninetack.io/post/improving-rag-quality-by-summarization)
* [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
* Chatbot Development:
* [Streamlit](https://discuss.streamlit.io/):
Expand Down
24 changes: 15 additions & 9 deletions chatbot/bot/client/lama_cpp_client.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
from pathlib import Path
from typing import Any, Iterator, Union
from typing import Any, Iterator

import requests
from llama_cpp import CreateCompletionResponse, CreateCompletionStreamResponse, Llama
Expand Down Expand Up @@ -158,7 +158,7 @@ def stream_answer(self, prompt: str, max_new_tokens: int = 512) -> str:

def start_answer_iterator_streamer(
self, prompt: str, max_new_tokens: int = 512
) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
) -> CreateCompletionResponse | Iterator[CreateCompletionStreamResponse]:
"""
Abstract method to start an answer iterator streamer for a given prompt.
Expand All @@ -181,7 +181,7 @@ def start_answer_iterator_streamer(

async def async_start_answer_iterator_streamer(
self, prompt: str, max_new_tokens: int = 512
) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
) -> CreateCompletionResponse | Iterator[CreateCompletionStreamResponse]:
"""
This abstract method should be implemented to asynchronously start an answer iterator streamer,
providing a flexible way to generate answers in a streaming fashion based on the given prompt.
Expand All @@ -203,10 +203,12 @@ async def async_start_answer_iterator_streamer(

return stream

def parse_token(self, token):
@staticmethod
def parse_token(token):
return token["choices"][0]["delta"].get("content", "")

def generate_qa_prompt(self, question: str) -> str:
@staticmethod
def generate_qa_prompt(question: str) -> str:
"""
Generates a question-answering (QA) prompt using predefined templates.
Expand All @@ -222,7 +224,8 @@ def generate_qa_prompt(self, question: str) -> str:
question=question,
)

def generate_ctx_prompt(self, question: str, context: str) -> str:
@staticmethod
def generate_ctx_prompt(question: str, context: str) -> str:
"""
Generates a context-based prompt using predefined templates.
Expand All @@ -240,7 +243,8 @@ def generate_ctx_prompt(self, question: str, context: str) -> str:
context=context,
)

def generate_refined_ctx_prompt(self, question: str, context: str, existing_answer: str) -> str:
@staticmethod
def generate_refined_ctx_prompt(question: str, context: str, existing_answer: str) -> str:
"""
Generates a refined prompt for question-answering with existing answer.
Expand All @@ -260,15 +264,17 @@ def generate_refined_ctx_prompt(self, question: str, context: str, existing_answ
existing_answer=existing_answer,
)

def generate_refined_question_conversation_awareness_prompt(self, question: str, chat_history: str) -> str:
@staticmethod
def generate_refined_question_conversation_awareness_prompt(question: str, chat_history: str) -> str:
return generate_conversation_awareness_prompt(
template=REFINED_QUESTION_CONVERSATION_AWARENESS_PROMPT_TEMPLATE,
system=SYSTEM_TEMPLATE,
question=question,
chat_history=chat_history,
)

def generate_refined_answer_conversation_awareness_prompt(self, question: str, chat_history: str) -> str:
@staticmethod
def generate_refined_answer_conversation_awareness_prompt(question: str, chat_history: str) -> str:
return generate_conversation_awareness_prompt(
template=REFINED_ANSWER_CONVERSATION_AWARENESS_PROMPT_TEMPLATE,
system=SYSTEM_TEMPLATE,
Expand Down
12 changes: 6 additions & 6 deletions chatbot/bot/conversation/conversation_retrieval.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from asyncio import get_event_loop
from typing import Any, List, Tuple
from typing import Any

from entities.document import Document
from helpers.log import get_logger
from langchain_core.documents import Document

from bot.client.lama_cpp_client import LamaCppClient
from bot.conversation.ctx_strategy import AsyncTreeSummarizationStrategy, BaseSynthesisStrategy
Expand Down Expand Up @@ -30,7 +30,7 @@ def __init__(self, llm: LamaCppClient) -> None:
self.llm = llm
self.chat_history = []

def get_chat_history(self) -> List[Tuple[str, str]]:
def get_chat_history(self) -> list[tuple[str, str]]:
"""
Gets the chat history.
Expand All @@ -40,7 +40,7 @@ def get_chat_history(self) -> List[Tuple[str, str]]:
"""
return self.chat_history

def update_chat_history(self, question: str, answer: str) -> List[Tuple[str, str]]:
def update_chat_history(self, question: str, answer: str) -> list[tuple[str, str]]:
"""
Updates the chat history.
Expand All @@ -57,7 +57,7 @@ def update_chat_history(self, question: str, answer: str) -> List[Tuple[str, str

return self.chat_history

def keep_chat_history_size(self, max_size: int = 2) -> List[Tuple[str, str]]:
def keep_chat_history_size(self, max_size: int = 2) -> list[tuple[str, str]]:
"""
Keeps the list of chat history at the specified maximum size by popping out the oldest elements.
Expand Down Expand Up @@ -160,7 +160,7 @@ def answer(self, question: str, max_new_tokens: int = 512) -> Any:
def context_aware_answer(
ctx_synthesis_strategy: BaseSynthesisStrategy,
question: str,
retrieved_contents: List[Document],
retrieved_contents: list[Document],
max_new_tokens: int = 512,
):
if isinstance(ctx_synthesis_strategy, AsyncTreeSummarizationStrategy):
Expand Down
22 changes: 11 additions & 11 deletions chatbot/bot/conversation/ctx_strategy.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import asyncio
from enum import Enum
from typing import Any, List, Union
from typing import Any

import nest_asyncio
from entities.document import Document
from helpers.log import get_logger
from langchain_core.documents import Document

from bot.client.lama_cpp_client import LamaCppClient

Expand Down Expand Up @@ -35,7 +35,7 @@ def __init__(self, llm: LamaCppClient) -> None:
"""
self.llm = llm

def generate_response(self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512):
def generate_response(self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512):
"""
Generate a response using the synthesis strategy.
Expand All @@ -61,8 +61,8 @@ def __init__(self, llm: LamaCppClient):
super().__init__(llm)

def generate_response(
self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512
) -> Union[str, Any]:
self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512
) -> str | Any:
"""
Generate a response using create and refine strategy.
Expand Down Expand Up @@ -126,7 +126,7 @@ def __init__(self, llm: LamaCppClient):
super().__init__(llm)

def generate_response(
self, retrieved_contents: List[Document], question: str, max_new_tokens: int = 512, num_children: int = 2
self, retrieved_contents: list[Document], question: str, max_new_tokens: int = 512, num_children: int = 2
) -> Any:
"""
Generate a response using hierarchical summarization strategy.
Expand Down Expand Up @@ -170,9 +170,9 @@ def generate_response(

def combine_results(
self,
texts: List[str],
texts: list[str],
question: str,
cur_prompt_list: List[str],
cur_prompt_list: list[str],
max_new_tokens: int = 512,
num_children: int = 2,
) -> Any:
Expand Down Expand Up @@ -227,7 +227,7 @@ def __init__(self, llm: LamaCppClient):

async def generate_response(
self,
retrieved_contents: List[Document],
retrieved_contents: list[Document],
question: str,
max_new_tokens: int = 512,
num_children: int = 2,
Expand Down Expand Up @@ -278,9 +278,9 @@ async def generate_response(

async def combine_results(
self,
texts: List[str],
texts: list[str],
question: str,
cur_prompt_list: List[str],
cur_prompt_list: list[str],
max_new_tokens: int = 512,
num_children: int = 2,
):
Expand Down
51 changes: 42 additions & 9 deletions chatbot/bot/memory/embedder.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,49 @@
from abc import ABC
from typing import Any

from langchain.embeddings import HuggingFaceEmbeddings
import sentence_transformers


class Embedder(ABC):
embedder: Any
class Embedder:
def __init__(self, model_name: str = "all-MiniLM-L6-v2", cache_folder: str | None = None, **kwargs: Any):
"""
Initialize the Embedder class with the specified parameters.
def get_embedding(self):
return self.embedder
Args:
**kwargs (Any): Additional keyword arguments to pass to the SentenceTransformer model.
"""
self.client = sentence_transformers.SentenceTransformer(model_name, cache_folder=cache_folder, **kwargs)

def embed_documents(self, texts: list[str], multi_process: bool = False, **encode_kwargs: Any) -> list[list[float]]:
"""
Compute document embeddings using a transformer model.
class EmbedderHuggingFace(Embedder):
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.embedder = HuggingFaceEmbeddings(model_name=model_name)
Args:
texts (list[str]): The list of texts to embed.
multi_process (bool): If True, use multiple processes to compute embeddings.
**encode_kwargs (Any): Additional keyword arguments to pass when calling the `encode` method of the model.
Returns:
list[list[float]]: A list of embeddings, one for each text.
"""

texts = list(map(lambda x: x.replace("\n", " "), texts))
if multi_process:
pool = self.client.start_multi_process_pool()
embeddings = self.client.encode_multi_process(texts, pool)
sentence_transformers.SentenceTransformer.stop_multi_process_pool(pool)
else:
embeddings = self.client.encode(texts, show_progress_bar=True, **encode_kwargs)

return embeddings.tolist()

def embed_query(self, text: str) -> list[float]:
"""
Compute query embeddings using a transformer model.
Args:
text (str): The text to embed.
Returns:
list[float]: Embeddings for the text.
"""
return self.embed_documents([text])[0]
Loading

0 comments on commit f91e37a

Please sign in to comment.