Skip to content

Latest commit

 

History

History
278 lines (228 loc) · 17.3 KB

README.md

File metadata and controls

278 lines (228 loc) · 17.3 KB

RAG (Retrieval-augmented generation) ChatBot

CI pre-commit Code style: Ruff

Important

Disclaimer: The code has been tested on:

  • Ubuntu 22.04.2 LTS running on a Lenovo Legion 5 Pro with twenty 12th Gen Intel® Core™ i7-12700H and an NVIDIA GeForce RTX 3060.
  • MacOS Sonoma 14.3.1 running on a MacBook Pro M1 (2020).

If you are using another Operating System or different hardware, and you can't load the models, please take a look at the official Llama Cpp Python's GitHub issue.

Warning

  • lama_cpp_pyhon doesn't use GPU on M1 if you are running an x86 version of Python. More info here.
  • It's important to note that the large language model sometimes generates hallucinations or false information.

Table of contents

Introduction

This project combines the power of Lama.cpp, Chroma and Streamlit to build:

  • a Conversation-aware Chatbot (ChatGPT like experience).
  • a RAG (Retrieval-augmented generation) ChatBot.

The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer based on the context provided by those files.

rag-chatbot-architecture-1.png

Note

We decided to grab and refactor the RecursiveCharacterTextSplitter class from LangChain to effectively chunk Markdown files without adding LangChain as a dependency.

The Memory Builder component of the project loads Markdown pages from the docs folder. It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these sections with the all-MiniLM-L6-v2 sentence-transformer, and saves them in an embedding database called Chroma for later use.

When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database. Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question, then conduct retrieval-augmented reading. The most relevant sections are then used as context to generate the final answer using a local language model (LLM). Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the relevant context from previous conversations to provide more accurate answers.

To deal with context overflows, we implemented three approaches:

  • Create And Refine the Context: synthesize a responses sequentially through all retrieved contents.
    • create-and-refine-the-context.png
  • Hierarchical Summarization of Context: generate an answer for each relevant section independently, and then hierarchically combine the answers.
    • hierarchical-summarization.png
  • Async Hierarchical Summarization of Context: parallelized version of the Hierarchical Summarization of Context which lead to big speedups in response synthesis.

Prerequisites

  • Python 3.10+
  • GPU supporting CUDA 12.1+
  • Poetry 1.7.0

Install Poetry

Install Poetry with the official installer by following this link.

You must use the current adopted version of Poetry defined here.

If you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:

poetry self update <version>

Bootstrap Environment

To easily install the dependencies we created a make file.

How to use the make file

Important

Run Setup as your init command (or after Clean).

  • Check: make check
    • Use it to check that which pip3 and which python3 points to the right path.
  • Setup:
    • Setup with NVIDIA CUDA acceleration: make setup_cuda
      • Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
    • Setup with Metal GPU acceleration: make setup_metal
      • Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
  • Update: make update
    • Update an environment and installs all updated dependencies.
  • Tidy up the code: make tidy
    • Run Ruff check and format.
  • Clean: make clean
    • Removes the environment and all cached files.
  • Test: make test
    • Runs all tests.
    • Using pytest

Using the Open-Source Models Locally

We utilize the open-source library llama-cpp-python, a binding for llama-cpp, allowing us to utilize it within a Python environment. llama-cpp serves as a C++ backend designed to work efficiently with transformer-based models. Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters. This library enable us to run them either on a CPU or GPU. Additionally, we use the Quantization and 4-bit precision to reduce number of bits required to represent the numbers. The quantized models are stored in GGML/GGUF format.

Supported Models

🤖 Model Supported Model Size Max Context Window Notes and link to the model card
llama-3.2 Meta Llama 3.2 Instruct 1B 128k Optimized to run locally on a mobile or edge device - Card
llama-3.2 Meta Llama 3.2 Instruct 3B 128k Optimized to run locally on a mobile or edge device - Card
llama-3.1 Meta Llama 3.1 Instruct 8B 128k Recommended model Card
openchat-3.6 - OpenChat 3.6 8B 8192 Card
openchat-3.5 - OpenChat 3.5 7B 8192 Card
starling Starling Beta 7B 8192 Is trained from Openchat-3.5-0106. It's recommended if you prefer more verbosity over OpenChat - Card
phi-3.5 Phi-3.5 Mini Instruct 3.8B 128k Card
stablelm-zephyr StableLM Zephyr OpenOrca 3B 4096 Card

Supported Response Synthesis strategies

✨ Response Synthesis strategy Supported Notes
create-and-refine Create and Refine
tree-summarization Tree Summarization
async-tree-summarization - Recommended - Async Tree Summarization

Example Data

You could download some Markdown pages from the Blendle Employee Handbook and put them under docs.

Build the memory index

Run:

python chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50

Run the Chatbot

To interact with a GUI type:

streamlit run chatbot/chatbot_app.py -- --model llama-3.1 --max-new-tokens 1024

conversation-aware-chatbot.gif

Run the RAG Chatbot

To interact with a GUI type:

streamlit run chatbot/rag_chatbot_app.py -- --model llama-3.1 --k 2 --synthesis-strategy async-tree-summarization

rag_chatbot_example.gif

How to debug the Streamlit app on Pycharm

debug_streamlit.png

References