HamiltonBot

This is the repository for the HamiltonBot app. It is a streamlit app that utilizes langchain and chromadb to create a chat-with-pdf/docx-app via OpenAI's GPT series of models. This is fairly standard RAG QnA app. If you want to see just the RAG code, see RAG-implementation.ipynb.

Running the App

You will need to install python>=3.1.1 and virtualenv.

git clone https://github.com/DarkHawk727/HamiltonBot
cd HamiltonBot
python -m virtualenv venv
.\venv\Scripts\activate
pip install -r requirements.txt
streamlit run app.py

Architecture Diagram

Credits: Greg Kamradt from fullstackretrieval

The notebook is intended for both myself in the future, employers who just want to see the RAG code, and any future interns who want just the core functionality to improve upon. I opted to use the following packages for this project:

langchain: I like the simplicity and elegance the abstractions provide. Our application is also not super niche so it would save reinventing the wheel in a lot of cases.
langchain_openai: As of writing this (January 2024), OpenAI has the current best models; There is also a partnership between Microsoft and OpenAI. This is important because we currently have the microsoft suite and must use models from them.
chromadb: Chroma lets me have a local vectorstore for storing the embeddings, which simplifies a lot of the security and drives down the cost.
unstructured: Given that the RFPs are fairly complex documents with tables and images, I would need a way to parse them into html and base64 formats to feed into the LLMs. (Check issues of tenancy, they say that they don't store, if it's not allowed, try the Hosted SaaS API)

Query Transformation

As you may know, the first step of any RAG pipeline is to transform the query. This should be done so that the quality and relevance of the documents retrieved can be better. There are a couple techniques for this:

Rewrite-Retrive-Read: Tell an LLM to try and improve the query by rephrasing it.
Multi-Query: Have an LLM generate 2-3 queries that ask the same thing different ways.
Step-back Prompting: Have the LLM to ask some "more basic" questions such as asking what principles are being used in the question.
RAG-Fusion: ???

I have currently selected Multi-Query as the Query Transformation. I believe this balances the cost with performance, where Rewrite-Retrieve-Read would be too simple and Step-back Prompting would be too expensive and slow.

ℹ️ The questions that are going to be used to query the vectorstore are not going to be accessible as regular strings, instead they are logs so some code will be required should you want to get the questions as strings.

Article on Query Transformations on the Langchain Blog.

Document Loaders

Since the eventual usecase for this system will be QnA with large (200+ page) documents, it's important to chunk them up into more manageable chunks. I would like to experiment (use a different prebuilt function) with Semantic Splitting. Semantic Splitting works by going through the document text 3 consecutive sentences at a time. If the embeddings of two groups of sentences are similar, it will merge both groups into a single chunk. This way it groups sentences with similar semantic content.

Retrieval Method

When getting documents from the vectorstore that relate to a certain query, there are a couple options on how to select them.

The naive approach is the find the return the $k$ most similar embeddings of the documents to the query. This is fine, but for more complex documents (like RFPs), it can be helpful to maximize diversity of the documents.
Maximal-Marginal Relevance (MMR): This works by finding the embeddings with the greatest cosine similarity to the query but also penalizing them for similarity to already selected documents.

Given the nature of RFPs, I will be choosing to use MMR (It's just an option that you can choose from in the .as_retriever() method) as it performs better on more complex queries.

Langchain Docs for Selecting Documents with MMR

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
AIResearch		AIResearch
chroma_db		chroma_db
example_documents		example_documents
.gitignore		.gitignore
LICENSE		LICENSE
RAG-implementation.ipynb		RAG-implementation.ipynb
app.py		app.py
architecture_diagram.png		architecture_diagram.png
document_data_analysis.ipynb		document_data_analysis.ipynb
pedpolicies-storm-drainage-policy.pdf		pedpolicies-storm-drainage-policy.pdf
readme.md		readme.md
requirements.txt		requirements.txt
subsystem_1.py		subsystem_1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HamiltonBot

Running the App

Architecture Diagram

Query Transformation

Document Loaders

Retrieval Method

Useful Links:

About

Releases

Packages

Languages

License

DarkHawk727/YARA

Folders and files

Latest commit

History

Repository files navigation

HamiltonBot

Running the App

Architecture Diagram

Query Transformation

Document Loaders

Retrieval Method

Useful Links:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages