diff --git a/README.md b/README.md index 0f4704b..adabda6 100644 --- a/README.md +++ b/README.md @@ -141,7 +141,8 @@ format. | 🤖 Model | Supported | Model Size | Max Context Window | Notes and link to the model card | |--------------------------------------------|-----------|------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `llama-3.2` Meta Llama 3.2 Instruct | ✅ | 3B | 128k | **Recommended model** optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) | +| `llama-3.2` Meta Llama 3.2 Instruct | ✅ | 1B | 128k | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF) | +| `llama-3.2` Meta Llama 3.2 Instruct | ✅ | 3B | 128k | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) | | `llama-3.1` Meta Llama 3.1 Instruct | ✅ | 8B | 128k | **Recommended model** [Card](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF) | | `openchat-3.6` - OpenChat 3.6 | ✅ | 8B | 8192 | [Card](https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF) | | `openchat-3.5` - OpenChat 3.5 | ✅ | 7B | 8192 | [Card](https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF) | @@ -198,64 +199,64 @@ streamlit run chatbot/rag_chatbot_app.py -- --model llama-3.2 --k 2 --synthesis- ## References * Large Language Models (LLMs): - * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/) - * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/) - * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c) - * [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration) + * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/) + * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/) + * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c) + * [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration) * LLM Frameworks: - * llama.cpp: - * [llama.cpp](https://github.com/ggerganov/llama.cpp) - * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) - * Ollama: - * [Ollama](https://github.com/ollama/ollama/tree/main) - * [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main) - * [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/) - * [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975) - * [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training) + * llama.cpp: + * [llama.cpp](https://github.com/ggerganov/llama.cpp) + * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) + * Ollama: + * [Ollama](https://github.com/ollama/ollama/tree/main) + * [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main) + * [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/) + * [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975) + * [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training) * Agent Frameworks: - * [PydanticAI](https://ai.pydantic.dev/) + * [PydanticAI](https://ai.pydantic.dev/) * Embeddings: - * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) - * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector - space and can be used for tasks like clustering or semantic search. + * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) + * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector + space and can be used for tasks like clustering or semantic search. * Vector Databases: - * Indexing algorithms: - * There are many algorithms for building indexes to optimize vector search. Most vector databases - implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great - articles explaining them, and the trade-off between `speed`, `memory` and `quality`: - * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/) - * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37) - * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/) - * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/) - * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the - expense of speed. - * [Chroma](https://www.trychroma.com/) - * [chroma](https://github.com/chroma-core/chroma) - * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#) + * Indexing algorithms: + * There are many algorithms for building indexes to optimize vector search. Most vector databases + implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great + articles explaining them, and the trade-off between `speed`, `memory` and `quality`: + * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/) + * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37) + * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/) + * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/) + * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the + expense of speed. + * [Chroma](https://www.trychroma.com/) + * [chroma](https://github.com/chroma-core/chroma) + * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#) * Retrieval Augmented Generation (RAG): - * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html) - * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb) - * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, - we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading. - * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking) - * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#) - * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/) - * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/) + * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html) + * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb) + * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, + we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading. + * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking) + * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#) + * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/) + * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/) * Chatbot UI: - * [Streamlit](https://discuss.streamlit.io/): - * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app) - * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout) - * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message) - * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state) - * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020) - * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource) - * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337) - * [Open WebUI](https://github.com/open-webui/open-webui) - * [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/) + * [Streamlit](https://discuss.streamlit.io/): + * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app) + * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout) + * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message) + * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state) + * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020) + * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource) + * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337) + * [Open WebUI](https://github.com/open-webui/open-webui) + * [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/) * Text Processing and Cleaning: - * [clean-text](https://github.com/jfilter/clean-text/tree/main) + * [clean-text](https://github.com/jfilter/clean-text/tree/main) * Inspirational Open Source Repositories: - * [lit-gpt](https://github.com/Lightning-AI/lit-gpt) - * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm) - * [AnythingLLM](https://useanything.com/) - * [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve) + * [lit-gpt](https://github.com/Lightning-AI/lit-gpt) + * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm) + * [AnythingLLM](https://useanything.com/) + * [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve) diff --git a/chatbot/bot/model/model_registry.py b/chatbot/bot/model/model_registry.py index 16be89b..fb00eea 100644 --- a/chatbot/bot/model/model_registry.py +++ b/chatbot/bot/model/model_registry.py @@ -1,33 +1,32 @@ from enum import Enum -from bot.model.settings.llama import Llama31Settings, Llama32Settings +from bot.model.settings.llama import Llama31Settings, Llama32OneSettings, Llama32ThreeSettings from bot.model.settings.openchat import OpenChat35Settings, OpenChat36Settings from bot.model.settings.phi import Phi35Settings from bot.model.settings.stablelm_zephyr import StableLMZephyrSettings from bot.model.settings.starling import StarlingSettings -class ModelType(Enum): - ZEPHYR = "zephyr" - MISTRAL = "mistral" - DOLPHIN = "dolphin" +class Model(Enum): STABLELM_ZEPHYR = "stablelm-zephyr" OPENCHAT_3_5 = "openchat-3.5" OPENCHAT_3_6 = "openchat-3.6" STARLING = "starling" PHI_3_5 = "phi-3.5" LLAMA_3_1 = "llama-3.1" - LLAMA_3_2 = "llama-3.2" + LLAMA_3_2_one = "llama-3.2:1b" + LLAMA_3_2_three = "llama-3.2" SUPPORTED_MODELS = { - ModelType.STABLELM_ZEPHYR.value: StableLMZephyrSettings, - ModelType.OPENCHAT_3_5.value: OpenChat35Settings, - ModelType.OPENCHAT_3_6.value: OpenChat36Settings, - ModelType.STARLING.value: StarlingSettings, - ModelType.PHI_3_5.value: Phi35Settings, - ModelType.LLAMA_3_1.value: Llama31Settings, - ModelType.LLAMA_3_2.value: Llama32Settings, + Model.STABLELM_ZEPHYR.value: StableLMZephyrSettings, + Model.OPENCHAT_3_5.value: OpenChat35Settings, + Model.OPENCHAT_3_6.value: OpenChat36Settings, + Model.STARLING.value: StarlingSettings, + Model.PHI_3_5.value: Phi35Settings, + Model.LLAMA_3_1.value: Llama31Settings, + Model.LLAMA_3_2_one.value: Llama32OneSettings, + Model.LLAMA_3_2_three.value: Llama32ThreeSettings, } diff --git a/chatbot/bot/model/settings/llama.py b/chatbot/bot/model/settings/llama.py index 53c25a2..5356461 100644 --- a/chatbot/bot/model/settings/llama.py +++ b/chatbot/bot/model/settings/llama.py @@ -12,7 +12,18 @@ class Llama31Settings(ModelSettings): config_answer = {"temperature": 0.7, "stop": []} -class Llama32Settings(ModelSettings): +class Llama32OneSettings(ModelSettings): + url = "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf" + file_name = "Llama-3.2-1B-Instruct-Q5_K_M.gguf" + config = { + "n_ctx": 4096, # The max sequence length to use - note that longer sequence lengths require much more resources + "n_threads": 8, # The number of CPU threads to use, tailor to your system and the resulting performance + "n_gpu_layers": 50, # The number of layers to offload to GPU, if you have GPU acceleration available + } + config_answer = {"temperature": 0.7, "stop": []} + + +class Llama32ThreeSettings(ModelSettings): # There is also the uncensored version: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-uncensored-GGUF url = "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q5_K_M.gguf" file_name = "Llama-3.2-3B-Instruct-Q5_K_M.gguf" diff --git a/demo.md b/demo.md index 2e37c2a..9b12bb7 100644 --- a/demo.md +++ b/demo.md @@ -1,21 +1,14 @@ # Story Chatbot - 1 -- Tell me something about Italy +- Tell me something about Italy. Be concise. - How many people live there? - Can you tell me the names of the countries that share a border with Italy? - Could you please remind me about the topic we were discussing earlier? # Story Chatbot - 2 -- In which country is Italy? -- Can you tell me the names of the countries that share a border with Italy? -- Could you please provide me with information on the main industries? -- Could you please remind me about the topic we were discussing earlier? - -# Story Chatbot - 3 - - Can you help me create a personalized morning routine that would help increase my productivity throughout the day? Start by asking me about my current habits and what activities energize me in the morning. -- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterwards. +- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterward. # Programming - 1 @@ -85,7 +78,7 @@ Make it X-rated and disgusting. # Story Rag Chatbot - 1 -- Tell me something about the Blendle Social Code -- What is the number of holidays per year? +- Tell me something about the Blendle Social Code. Be concise. +- What is the total amount of days off per year? - What are the perks and benefits? - Could you please remind me about the topic we were discussing earlier? diff --git a/images/conversation-aware-chatbot.gif b/images/conversation-aware-chatbot.gif index 6dc5cac..fbefda6 100644 Binary files a/images/conversation-aware-chatbot.gif and b/images/conversation-aware-chatbot.gif differ diff --git a/images/rag_chatbot_example.gif b/images/rag_chatbot_example.gif index 0b4f47b..57c8f60 100644 Binary files a/images/rag_chatbot_example.gif and b/images/rag_chatbot_example.gif differ diff --git a/tests/bot/client/test_lamacpp_client.py b/tests/bot/client/test_lamacpp_client.py index 01a6d6b..9b2efcf 100644 --- a/tests/bot/client/test_lamacpp_client.py +++ b/tests/bot/client/test_lamacpp_client.py @@ -3,7 +3,7 @@ import pytest from bot.client.lama_cpp_client import LamaCppClient -from bot.model.model_registry import ModelType, get_model_settings +from bot.model.model_registry import Model, get_model_settings @pytest.fixture @@ -18,7 +18,7 @@ def cpu_config(): @pytest.fixture def model_settings(): - model_setting = get_model_settings(ModelType.LLAMA_3_2.value) + model_setting = get_model_settings(Model.LLAMA_3_2.value) return model_setting