From 41d9bb4111595e93d6d069e27f6629be0794eafb Mon Sep 17 00:00:00 2001
From: Umberto Griffo <1609440+umbertogriffo@users.noreply.github.com>
Date: Sat, 25 May 2024 14:06:32 +0100
Subject: [PATCH] feat: add support to Starling, Llama3 and Phi-3 models

---
 README.md                                     | 200 ++++++++++--------
 chatbot/bot/model/model_settings.py           |  21 +-
 chatbot/bot/model/settings/__init__.py        |   0
 chatbot/bot/model/{ => settings}/dolphin.py   |   0
 chatbot/bot/model/settings/llama_3.py         |  68 ++++++
 chatbot/bot/model/{ => settings}/mistral.py   |   0
 .../bot/model/{ => settings}/neural_beagle.py |   0
 chatbot/bot/model/{ => settings}/openchat.py  |   0
 chatbot/bot/model/settings/phi_3.py           |  65 ++++++
 .../model/{ => settings}/stablelm_zephyr.py   |   0
 chatbot/bot/model/settings/starling.py        |  65 ++++++
 chatbot/bot/model/{ => settings}/zephyr.py    |   0
 todo.md                                       |  10 +-
 version/llama_cpp                             |   2 +-
 14 files changed, 334 insertions(+), 97 deletions(-)
 create mode 100644 chatbot/bot/model/settings/__init__.py
 rename chatbot/bot/model/{ => settings}/dolphin.py (100%)
 create mode 100644 chatbot/bot/model/settings/llama_3.py
 rename chatbot/bot/model/{ => settings}/mistral.py (100%)
 rename chatbot/bot/model/{ => settings}/neural_beagle.py (100%)
 rename chatbot/bot/model/{ => settings}/openchat.py (100%)
 create mode 100644 chatbot/bot/model/settings/phi_3.py
 rename chatbot/bot/model/{ => settings}/stablelm_zephyr.py (100%)
 create mode 100644 chatbot/bot/model/settings/starling.py
 rename chatbot/bot/model/{ => settings}/zephyr.py (100%)

diff --git a/README.md b/README.md
index 466ba75..23b1adf 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,19 @@
 # RAG (Retrieval-augmented generation) ChatBot
+
 [![CI](https://github.com/umbertogriffo/rag-chatbot/workflows/CI/badge.svg)](https://github.com/umbertogriffo/rag-chatbot/actions/workflows/ci.yaml)
 [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
 [![Code style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 
 > [!IMPORTANT]
 > Disclaimer:
-> The code has been tested on
->   * `Ubuntu 22.04.2 LTS` running on a Lenovo Legion 5 Pro with twenty `12th Gen Intel® Core™ i7-12700H` and an `NVIDIA GeForce RTX 3060`.
+> The code has been tested on:
+>   * `Ubuntu 22.04.2 LTS` running on a Lenovo Legion 5 Pro with twenty `12th Gen Intel® Core™ i7-12700H` and
+      an `NVIDIA GeForce RTX 3060`.
 >   * `MacOS Sonoma 14.3.1` running on a MacBook Pro M1 (2020).
 >
 > If you are using another Operating System or different hardware, and you can't load the models, please
-> take a look either at the official Llama Cpp Python's GitHub [issue](https://github.com/abetlen/llama-cpp-python/issues).
+> take a look either at the official Llama Cpp Python's
+> GitHub [issue](https://github.com/abetlen/llama-cpp-python/issues).
 > or at the official CTransformers's GitHub [issue](https://github.com/marella/ctransformers/issues)
 
 > [!WARNING]
@@ -20,11 +23,11 @@
 
 - [Introduction](#introduction)
 - [Prerequisites](#prerequisites)
-  - [Install Poetry](#install-poetry)
+    - [Install Poetry](#install-poetry)
 - [Bootstrap Environment](#bootstrap-environment)
-  - [How to use the make file](#how-to-use-the-make-file)
+    - [How to use the make file](#how-to-use-the-make-file)
 - [Using the Open-Source Models Locally](#using-the-open-source-models-locally)
-  - [Supported Models](#supported-models)
+    - [Supported Models](#supported-models)
 - [Example Data](#example-data)
 - [Build the memory index](#build-the-memory-index)
 - [Run the Chatbot](#run-the-chatbot)
@@ -34,13 +37,17 @@
 
 ## Introduction
 
-This project combines the power of [CTransformers](https://github.com/marella/ctransformers), [Lama.cpp](https://github.com/abetlen/llama-cpp-python),
-[LangChain](https://python.langchain.com/docs/get_started/introduction.html) (only used for document chunking and querying the Vector Database, and we plan to eliminate it entirely),
+This project combines the power
+of [Lama.cpp](https://github.com/abetlen/llama-cpp-python), [CTransformers](https://github.com/marella/ctransformers),
+[LangChain](https://python.langchain.com/docs/get_started/introduction.html) (only used for document chunking and
+querying the Vector Database, and we plan to eliminate it entirely),
 [Chroma](https://github.com/chroma-core/chroma) and [Streamlit](https://discuss.streamlit.io/) to build:
+
 * a Conversation-aware Chatbot (ChatGPT like experience).
 * a RAG (Retrieval-augmented generation) ChatBot.
 
-The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer
+The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the
+corresponding answer
 based on the context provided by those files.
 
 ![rag-chatbot-architecture-1.png](images/rag-chatbot-architecture-1.png)
@@ -48,21 +55,24 @@ based on the context provided by those files.
 The `Memory Builder` component of the project loads Markdown pages from the `docs` folder.
 It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these
 sections with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
-`sentence-transformer`, and saves them in an embedding database called [Chroma](https://github.com/chroma-core/chroma) for later use.
+`sentence-transformer`, and saves them in an embedding database called [Chroma](https://github.com/chroma-core/chroma)
+for later use.
 
 When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database.
-Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question,
+Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the
+question,
 then conduct retrieval-augmented reading.
 The most relevant sections are then used as context to generate the final answer using a local language model (LLM).
 Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the
 relevant context from previous conversations to provide more accurate answers.
 
 To deal with context overflows, we implemented two approaches:
-* `Create And Refine the Context`: synthesize a responses sequentially through all retrieved contents.
-  * ![create-and-refine-the-context.png](images/create-and-refine-the-context.png)
-* `Hierarchical Summarization of Context`: generate an answer for each relevant section independently, and then hierarchically combine the answers.
-  * ![hierarchical-summarization.png](images/hierarchical-summarization.png)
 
+* `Create And Refine the Context`: synthesize a responses sequentially through all retrieved contents.
+    * ![create-and-refine-the-context.png](images/create-and-refine-the-context.png)
+* `Hierarchical Summarization of Context`: generate an answer for each relevant section independently, and then
+  hierarchically combine the answers.
+    * ![hierarchical-summarization.png](images/hierarchical-summarization.png)
 
 ## Prerequisites
 
@@ -72,11 +82,14 @@ To deal with context overflows, we implemented two approaches:
 
 ### Install Poetry
 
-Install Poetry with the official installer by following this [link](https://python-poetry.org/docs/#installing-with-the-official-installer).
+Install Poetry with the official installer by following
+this [link](https://python-poetry.org/docs/#installing-with-the-official-installer).
 
-You must use the current adopted version of Poetry defined [here](https://github.com/umbertogriffo/rag-chatbot/blob/main/version/poetry).
+You must use the current adopted version of Poetry
+defined [here](https://github.com/umbertogriffo/rag-chatbot/blob/main/version/poetry).
 
 If you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:
+
 ```
 poetry self update <version>
 ```
@@ -91,48 +104,57 @@ To easily install the dependencies we created a make file.
 > Run `Setup` as your init command (or after `Clean`).
 
 * Check: ```make check```
-  * Use it to check that `which pip3` and `which python3` points to the right path.
+    * Use it to check that `which pip3` and `which python3` points to the right path.
 * Setup:
-  * Setup with NVIDIA CUDA acceleration: ```make setup_cuda```
-    * Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
-  * Setup with Metal GPU acceleration: ```make setup_metal```
-    * Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
+    * Setup with NVIDIA CUDA acceleration: ```make setup_cuda```
+        * Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
+    * Setup with Metal GPU acceleration: ```make setup_metal```
+        * Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
 * Update: ```make update```
-  * Update an environment and installs all updated dependencies.
+    * Update an environment and installs all updated dependencies.
 * Tidy up the code: ```make tidy```
-  * Run Ruff check and format.
+    * Run Ruff check and format.
 * Clean: ```make clean```
-  * Removes the environment and all cached files.
+    * Removes the environment and all cached files.
 * Test: ```make test```
-  * Runs all tests.
-  * Using [pytest](https://pypi.org/project/pytest/)
-
+    * Runs all tests.
+    * Using [pytest](https://pypi.org/project/pytest/)
 
 ## Using the Open-Source Models Locally
 
-We utilize two open-source libraries, [CTransformers](https://github.com/marella/ctransformers) and [Lama.cpp](https://github.com/abetlen/llama-cpp-python),
+We utilize two open-source libraries, [Lama.cpp](https://github.com/abetlen/llama-cpp-python)
+and [CTransformers](https://github.com/marella/ctransformers),
 which allow us to work efficiently with transformer-based models efficiently.
 Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of
 parameters. These libraries enable us to run them either on a `CPU` or `GPU`.
 Additionally, we use the `Quantization and 4-bit precision` to reduce number of bits required to represent the numbers.
-The quantized models are stored in [GGML/GGUF](https://medium.com/@phillipgimmi/what-is-gguf-and-ggml-e364834d241c) format.
+The quantized models are stored in [GGML/GGUF](https://medium.com/@phillipgimmi/what-is-gguf-and-ggml-e364834d241c)
+format.
 
 ### Supported Models
-* [(Recommended) OpenChat 3.5 7B - GGUF](https://huggingface.co/TheBloke/openchat_3.5-GGUF)
-* [NeuralBeagle14 7B - GGUF](https://huggingface.co/TheBloke/NeuralBeagle14-7B-GGUF)
-* [Dolphin 2.6 Mistral 7B DPO Laser - GGUF](https://huggingface.co/TheBloke/dolphin-2.6-mistral-7B-dpo-laser-GGUF)
-* [Zephyr 7B Beta - GGUF](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF)
-* [Mistral 7B OpenOrca - GGUF](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF)
-* [StableLM Zephyr 3B - GGUF](https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF)
+
+| 🤖 Model                                       | Supported | Model Size | Notes and link to the model                                                                                                                                          |
+|------------------------------------------------|-----------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `llama-3` Meta Llama 3 Instruct                | ✅         | 8B         | Less accurate than OpenChat - [link](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF)                                                                 |
+| `openchat` **Recommended** - OpenChat 3.5 0106 | ✅         | 7B         | [link](https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF)                                                                                                       |
+| `starling` Starling Beta                       | ✅         | 7B         | Is trained from `Openchat-3.5-0106`. It's recommended if you prefer more verbosity over OpenChat - [link](https://huggingface.co/bartowski/Starling-LM-7B-beta-GGUF) |
+| `neural-beagle` NeuralBeagle14                 | ✅         | 7B         | [link](https://huggingface.co/TheBloke/NeuralBeagle14-7B-GGUF)                                                                                                       |
+| `dolphin` Dolphin 2.6 Mistral DPO Laser        | ✅         | 7B         | [link](https://huggingface.co/TheBloke/dolphin-2.6-mistral-7B-dpo-laser-GGUF)                                                                                        |
+| `zephyr` Zephyr Beta                           | ✅         | 7B         | [link](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF)                                                                                                          |
+| `mistral` Mistral OpenOrca                     | ✅         | 7B         | [link](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF)                                                                                                     |
+| `phi-3` Phi-3 Mini 4K Instruct                 | ✅         | 3.8B       | [link](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)                                                                                                 |
+| `stablelm-zephyr` StableLM Zephyr OpenOrca     | ✅         | 3B         | [link](https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF)                                                                                                      |
 
 ## Example Data
 
-You could download some Markdown pages from the [Blendle Employee Handbook](https://blendle.notion.site/Blendle-s-Employee-Handbook-7692ffe24f07450785f093b94bbe1a09)
+You could download some Markdown pages from
+the [Blendle Employee Handbook](https://blendle.notion.site/Blendle-s-Employee-Handbook-7692ffe24f07450785f093b94bbe1a09)
 and put them under `docs`.
 
 ## Build the memory index
 
 Run:
+
 ```shell
 python chat/memory_builder.py --chunk-size 1000
 ```
@@ -140,14 +162,17 @@ python chat/memory_builder.py --chunk-size 1000
 ## Run the Chatbot
 
 To interact with a GUI type:
+
 ```shell
 streamlit run chatbot/chatbot_app.py -- --model openchat
 ```
+
 ![conversation-aware-chatbot.gif](images/conversation-aware-chatbot.gif)
 
 ## Run the RAG Chatbot
 
 To interact with a GUI type:
+
 ```shell
 streamlit run chatbot/rag_chatbot_app.py -- --model openchat --k 2 --synthesis-strategy async_tree_summarization
 ```
@@ -161,58 +186,63 @@ streamlit run chatbot/rag_chatbot_app.py -- --model openchat --k 2 --synthesis-s
 ## References
 
 * LLMs:
-  * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
-  * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
-  * [Attention Sinks in LLMs for endless fluency](https://huggingface.co/blog/tomaarsen/attention-sinks)
-  * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
-  * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
+    * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
+    * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
+    * [Attention Sinks in LLMs for endless fluency](https://huggingface.co/blog/tomaarsen/attention-sinks)
+    * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
+    * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
 * LLM integration and Modules:
-  * [LangChain](https://python.langchain.com/docs/get_started/introduction.html):
-    * [MarkdownTextSplitter](https://api.python.langchain.com/en/latest/_modules/langchain/text_splitter.html#MarkdownTextSplitter)
-    * [Chroma Integration](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma)
-    * [The Problem With LangChain](https://minimaxir.com/2023/07/langchain-problem/#:~:text=The%20problem%20with%20LangChain%20is,don't%20start%20with%20LangChain)
+    * [LangChain](https://python.langchain.com/docs/get_started/introduction.html):
+        * [MarkdownTextSplitter](https://api.python.langchain.com/en/latest/_modules/langchain/text_splitter.html#MarkdownTextSplitter)
+        * [Chroma Integration](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma)
+        * [The Problem With LangChain](https://minimaxir.com/2023/07/langchain-problem/#:~:text=The%20problem%20with%20LangChain%20is,don't%20start%20with%20LangChain)
 * Embeddings:
-  * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
-    * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+    * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+        * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
+          space and can be used for tasks like clustering or semantic search.
 * Vector Databases:
-  * [Chroma](https://www.trychroma.com/)
-  * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
-  * Indexing algorithms:
-    * There are many algorithms for building indexes to optimize vector search. Most vector databases implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
-      * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
-      * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
-      * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
-      * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
-      * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the expense of speed.
+    * [Chroma](https://www.trychroma.com/)
+    * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
+    * Indexing algorithms:
+        * There are many algorithms for building indexes to optimize vector search. Most vector databases
+          implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
+          articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
+            * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
+            * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
+            * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
+            * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
+            * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
+              expense of speed.
 * Retrieval Augmented Generation (RAG):
-  * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
-    * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
-  * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
-  * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
-  * [Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs](https://www.ninetack.io/post/improving-rag-quality-by-summarization)
+    * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
+        * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
+          we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
+    * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
+    * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
+    * [Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs](https://www.ninetack.io/post/improving-rag-quality-by-summarization)
 * Chatbot Development:
-  * [Streamlit](https://discuss.streamlit.io/):
-    * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
-    * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
-    * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
-    * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
-      * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
-    * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
-    * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
-  * [(Investigate) FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
-  * [(Investigate) Chat Templates to standardise the format](https://huggingface.co/blog/chat-templates)
-  * [(Investigate) Ollama](https://github.com/ollama/ollama)
+    * [Streamlit](https://discuss.streamlit.io/):
+        * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
+        * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
+        * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
+        * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
+            * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
+        * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
+        * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
+    * [(Investigate) FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
+    * [(Investigate) Chat Templates to standardise the format](https://huggingface.co/blog/chat-templates)
+    * [(Investigate) Ollama](https://github.com/ollama/ollama)
 * Text Processing and Cleaning:
-  * [clean-text](https://github.com/jfilter/clean-text/tree/main)
+    * [clean-text](https://github.com/jfilter/clean-text/tree/main)
 * Open Source Repositories:
-  * [CTransformers](https://github.com/marella/ctransformers)
-  * [GPT4All](https://github.com/nomic-ai/gpt4all)
-  * [llama.cpp](https://github.com/ggerganov/llama.cpp)
-  * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
-  * [pyllamacpp](https://github.com/abdeladim-s/pyllamacpp)
-  * [chroma](https://github.com/chroma-core/chroma)
-  * Inspirational repos:
-    * [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
-    * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
-    * [PrivateDocBot](https://github.com/Abhi5h3k/PrivateDocBot)
-    * [Rag_bot - Adaptive Intelligence Chatbot](https://github.com/kylejtobin/rag_bot)
+    * [CTransformers](https://github.com/marella/ctransformers)
+    * [GPT4All](https://github.com/nomic-ai/gpt4all)
+    * [llama.cpp](https://github.com/ggerganov/llama.cpp)
+    * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
+    * [pyllamacpp](https://github.com/abdeladim-s/pyllamacpp)
+    * [chroma](https://github.com/chroma-core/chroma)
+    * Inspirational repos:
+        * [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
+        * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
+        * [PrivateDocBot](https://github.com/Abhi5h3k/PrivateDocBot)
+        * [Rag_bot - Adaptive Intelligence Chatbot](https://github.com/kylejtobin/rag_bot)
diff --git a/chatbot/bot/model/model_settings.py b/chatbot/bot/model/model_settings.py
index 991bd66..ce4a051 100644
--- a/chatbot/bot/model/model_settings.py
+++ b/chatbot/bot/model/model_settings.py
@@ -1,11 +1,14 @@
 from enum import Enum
 
-from bot.model.dolphin import DolphinSettings
-from bot.model.mistral import MistralSettings
-from bot.model.neural_beagle import NeuralBeagleSettings
-from bot.model.openchat import OpenChatSettings
-from bot.model.stablelm_zephyr import StableLMZephyrSettings
-from bot.model.zephyr import ZephyrSettings
+from bot.model.settings.dolphin import DolphinSettings
+from bot.model.settings.llama_3 import LlamaThreeSettings
+from bot.model.settings.mistral import MistralSettings
+from bot.model.settings.neural_beagle import NeuralBeagleSettings
+from bot.model.settings.openchat import OpenChatSettings
+from bot.model.settings.phi_3 import PhiThreeSettings
+from bot.model.settings.stablelm_zephyr import StableLMZephyrSettings
+from bot.model.settings.starling import StarlingSettings
+from bot.model.settings.zephyr import ZephyrSettings
 
 
 class ModelType(Enum):
@@ -14,7 +17,10 @@ class ModelType(Enum):
     DOLPHIN = "dolphin"
     STABLELM_ZEPHYR = "stablelm-zephyr"
     OPENCHAT = "openchat"
+    STARLING = "starling"
     NEURAL_BEAGLE = "neural-beagle"
+    PHI_3 = "phi-3"
+    LLAMA_3 = "llama-3"
 
 
 SUPPORTED_MODELS = {
@@ -23,7 +29,10 @@ class ModelType(Enum):
     ModelType.DOLPHIN.value: DolphinSettings,
     ModelType.STABLELM_ZEPHYR.value: StableLMZephyrSettings,
     ModelType.OPENCHAT.value: OpenChatSettings,
+    ModelType.STARLING.value: StarlingSettings,
     ModelType.NEURAL_BEAGLE.value: NeuralBeagleSettings,
+    ModelType.PHI_3.value: PhiThreeSettings,
+    ModelType.LLAMA_3.value: LlamaThreeSettings,
 }
 
 
diff --git a/chatbot/bot/model/settings/__init__.py b/chatbot/bot/model/settings/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/chatbot/bot/model/dolphin.py b/chatbot/bot/model/settings/dolphin.py
similarity index 100%
rename from chatbot/bot/model/dolphin.py
rename to chatbot/bot/model/settings/dolphin.py
diff --git a/chatbot/bot/model/settings/llama_3.py b/chatbot/bot/model/settings/llama_3.py
new file mode 100644
index 0000000..8cfc22f
--- /dev/null
+++ b/chatbot/bot/model/settings/llama_3.py
@@ -0,0 +1,68 @@
+from bot.client.llm_client import LlmClientType
+from bot.model.model import Model
+
+
+class LlamaThreeSettings(Model):
+    url = "https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
+    file_name = "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"
+    clients = [LlmClientType.LAMA_CPP]
+    config = {
+        "n_ctx": 4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
+        "n_threads": 8,  # The number of CPU threads to use, tailor to your system and the resulting performance
+        "n_gpu_layers": 50,  # The number of layers to offload to GPU, if you have GPU acceleration available
+    }
+    config_answer = {"temperature": 0.7, "stop": []}
+    system_template = (
+        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful, respectful and "
+        "honest assistant. <|eot_id|><|start_header_id|>user<|end_header_id|>"
+    )
+    qa_prompt_template = """{system}\n
+Answer the question below:
+{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"""
+    ctx_prompt_template = """{system}\n
+Context information is below.
+---------------------
+{context}
+---------------------
+Given the context information and not prior knowledge, answer the question below:
+{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"""
+    refined_ctx_prompt_template = """{system}\n
+{question}
+We have provided an existing answer: {existing_answer}
+We have the opportunity to refine the existing answer
+(only if needed) with some more context below.
+---------------------
+{context}
+---------------------
+Given the new context, refine the original answer to better answer the query.
+If the context isn't useful, return the original answer.
+Refined Answer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"""
+    refined_question_conversation_awareness_prompt_template = """{system}\n
+Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}
+Given the above conversation and a follow up question, rephrase the follow up question to be a standalone question.
+Standalone question:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"""
+
+    refined_answer_conversation_awareness_prompt_template = """
+You are engaging in a conversation with a human participant who is unaware that they might be
+interacting with a machine. \n
+Your goal is to respond in a way that convincingly simulates human-like intelligence and behavior. \n
+The conversation should be natural, coherent, and contextually relevant. \n
+Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}\n
+Given the context provided in the Chat History and the follow up question, please answer the follow up question above.
+If the follow up question isn't correlated to the context provided in the Chat History, please just answer the follow up
+question, ignoring the context provided in the Chat History.
+Please also don't reformulate the follow up question, and write just a concise answer.
+<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+"""
diff --git a/chatbot/bot/model/mistral.py b/chatbot/bot/model/settings/mistral.py
similarity index 100%
rename from chatbot/bot/model/mistral.py
rename to chatbot/bot/model/settings/mistral.py
diff --git a/chatbot/bot/model/neural_beagle.py b/chatbot/bot/model/settings/neural_beagle.py
similarity index 100%
rename from chatbot/bot/model/neural_beagle.py
rename to chatbot/bot/model/settings/neural_beagle.py
diff --git a/chatbot/bot/model/openchat.py b/chatbot/bot/model/settings/openchat.py
similarity index 100%
rename from chatbot/bot/model/openchat.py
rename to chatbot/bot/model/settings/openchat.py
diff --git a/chatbot/bot/model/settings/phi_3.py b/chatbot/bot/model/settings/phi_3.py
new file mode 100644
index 0000000..e94d3ee
--- /dev/null
+++ b/chatbot/bot/model/settings/phi_3.py
@@ -0,0 +1,65 @@
+from bot.client.llm_client import LlmClientType
+from bot.model.model import Model
+
+
+class PhiThreeSettings(Model):
+    url = "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"
+    file_name = "Phi-3-mini-4k-instruct-q4.gguf"
+    clients = [LlmClientType.LAMA_CPP]
+    config = {
+        "n_ctx": 4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
+        "n_threads": 8,  # The number of CPU threads to use, tailor to your system and the resulting performance
+        "n_gpu_layers": 50,  # The number of layers to offload to GPU, if you have GPU acceleration available
+    }
+    config_answer = {"temperature": 0.7, "stop": []}
+    system_template = "You are a helpful, respectful and honest assistant. "
+    qa_prompt_template = """{system}\n
+<|user|>\n Answer the question below:
+{question}<|end|>\n<|assistant|>
+"""
+    ctx_prompt_template = """{system}\n
+<|user|>\n Context information is below.
+---------------------
+{context}
+---------------------
+Given the context information and not prior knowledge, answer the question below:
+{question}<|end|>\n<|assistant|>
+"""
+    refined_ctx_prompt_template = """{system}\n
+<|user|>\n {question}
+We have provided an existing answer: {existing_answer}
+We have the opportunity to refine the existing answer
+(only if needed) with some more context below.
+---------------------
+{context}
+---------------------
+Given the new context, refine the original answer to better answer the query.
+If the context isn't useful, return the original answer.
+Refined Answer:<|end|>\n<|assistant|>
+"""
+    refined_question_conversation_awareness_prompt_template = """{system}\n
+<|user|>\n Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}
+Given the above conversation and a follow up question, rephrase the follow up question to be a standalone question.
+Standalone question:<|end|>\n<|assistant|>
+"""
+
+    refined_answer_conversation_awareness_prompt_template = """
+<|user|>\n You are engaging in a conversation with a human participant who is unaware that they might be
+interacting with a machine. \n
+Your goal is to respond in a way that convincingly simulates human-like intelligence and behavior. \n
+The conversation should be natural, coherent, and contextually relevant. \n
+Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}\n
+Given the context provided in the Chat History and the follow up question, please answer the follow up question above.
+If the follow up question isn't correlated to the context provided in the Chat History, please just answer the follow up
+question, ignoring the context provided in the Chat History.
+Please also don't reformulate the follow up question, and write just a concise answer.
+<|end|>\n<|assistant|>
+"""
diff --git a/chatbot/bot/model/stablelm_zephyr.py b/chatbot/bot/model/settings/stablelm_zephyr.py
similarity index 100%
rename from chatbot/bot/model/stablelm_zephyr.py
rename to chatbot/bot/model/settings/stablelm_zephyr.py
diff --git a/chatbot/bot/model/settings/starling.py b/chatbot/bot/model/settings/starling.py
new file mode 100644
index 0000000..05584a7
--- /dev/null
+++ b/chatbot/bot/model/settings/starling.py
@@ -0,0 +1,65 @@
+from bot.client.llm_client import LlmClientType
+from bot.model.model import Model
+
+
+class StarlingSettings(Model):
+    url = "https://huggingface.co/bartowski/Starling-LM-7B-beta-GGUF/resolve/main/Starling-LM-7B-beta-Q4_K_M.gguf"
+    file_name = "Starling-LM-7B-beta-Q4_K_M.gguf"
+    clients = [LlmClientType.LAMA_CPP]
+    config = {
+        "n_ctx": 4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
+        "n_threads": 8,  # The number of CPU threads to use, tailor to your system and the resulting performance
+        "n_gpu_layers": 50,  # The number of layers to offload to GPU, if you have GPU acceleration available
+    }
+    config_answer = {"temperature": 0.7, "stop": []}
+    system_template = "You are a helpful, respectful and honest assistant. "
+    qa_prompt_template = """{system}\n
+GPT4 Correct User: Answer the question below:
+{question}<|end_of_turn|>GPT4 Correct Assistant:
+"""
+    ctx_prompt_template = """{system}\n
+GPT4 Correct User: Context information is below.
+---------------------
+{context}
+---------------------
+Given the context information and not prior knowledge, answer the question below:
+{question}<|end_of_turn|>GPT4 Correct Assistant:
+"""
+    refined_ctx_prompt_template = """{system}\n
+GPT4 Correct User: The original query is as follows: {question}
+We have provided an existing answer: {existing_answer}
+We have the opportunity to refine the existing answer
+(only if needed) with some more context below.
+---------------------
+{context}
+---------------------
+Given the new context, refine the original answer to better answer the query.
+If the context isn't useful, return the original answer.
+Refined Answer:<|end_of_turn|>GPT4 Correct Assistant:
+"""
+    refined_question_conversation_awareness_prompt_template = """{system}\n
+GPT4 Correct User: Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}
+Given the above conversation and a follow up question, rephrase the follow up question to be a standalone question.
+Standalone question:<|end_of_turn|>GPT4 Correct Assistant:
+"""
+
+    refined_answer_conversation_awareness_prompt_template = """
+GPT4 Correct User: You are engaging in a conversation with a human participant who is unaware that they might be
+interacting with a machine. \n
+Your goal is to respond in a way that convincingly simulates human-like intelligence and behavior. \n
+The conversation should be natural, coherent, and contextually relevant. \n
+Chat History:
+---------------------
+{chat_history}
+---------------------
+Follow Up Question: {question}\n
+Given the context provided in the Chat History and the follow up question, please answer the follow up question above.
+If the follow up question isn't correlated to the context provided in the Chat History, please just answer the follow up
+question, ignoring the context provided in the Chat History.
+Please also don't reformulate the follow up question, and write just a concise answer.
+<|end_of_turn|>GPT4 Correct Assistant:
+"""
diff --git a/chatbot/bot/model/zephyr.py b/chatbot/bot/model/settings/zephyr.py
similarity index 100%
rename from chatbot/bot/model/zephyr.py
rename to chatbot/bot/model/settings/zephyr.py
diff --git a/todo.md b/todo.md
index de56533..e3d9d03 100644
--- a/todo.md
+++ b/todo.md
@@ -1,6 +1,6 @@
 # Todo
-- [ ] `llama-cpp-python` version `0.2.29` has a serious issue https://github.com/abetlen/llama-cpp-python/issues/1089 - Introduce unit tests to update to newer `llama-cpp-python` versions confidently.
-- [ ] try https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF (also the beta version).
-- [ ] try https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf
-- [ ] try Chat Templates https://medium.com/@ahmet_celebi/demystifying-chat-templates-of-llm-using-llama-cpp-and-ctransformers-f17871569cd6
-- [ ] make docker container
+- Test `openchat-3.6-8b-20240522`:
+  - https://huggingface.co/openchat/openchat-3.6-8b-20240522
+  - https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF
+- Try Chat Templates https://medium.com/@ahmet_celebi/demystifying-chat-templates-of-llm-using-llama-cpp-and-ctransformers-f17871569cd6
+- Make docker container
diff --git a/version/llama_cpp b/version/llama_cpp
index 8bc53d5..1e8d670 100644
--- a/version/llama_cpp
+++ b/version/llama_cpp
@@ -1 +1 @@
-0.2.28
+0.2.76