feat: add Llama 3.2 1B

umbertogriffo · Dec 19, 2024 · d017097 · d017097
1 parent cc6e828
commit d017097
Show file tree

Hide file tree

Showing 7 changed files with 85 additions and 81 deletions.
diff --git a/README.md b/README.md
@@ -141,7 +141,8 @@ format.
 
 | 🤖 Model                                   | Supported | Model Size | Max Context Window | Notes and link to the model card                                                                                                                                     |
 |--------------------------------------------|-----------|------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `llama-3.2` Meta Llama 3.2 Instruct        | ✅         | 3B         | 128k               | **Recommended model** optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF)                      |
+| `llama-3.2` Meta Llama 3.2 Instruct        | ✅         | 1B         | 128k               | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)                                            |
+| `llama-3.2` Meta Llama 3.2 Instruct        | ✅         | 3B         | 128k               | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF)                                            |
 | `llama-3.1` Meta Llama 3.1 Instruct        | ✅         | 8B         | 128k               | **Recommended model** [Card](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)                                                                       |
 | `openchat-3.6` - OpenChat 3.6              | ✅         | 8B         | 8192               | [Card](https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF)                                                                                               |
 | `openchat-3.5` - OpenChat 3.5              | ✅         | 7B         | 8192               | [Card](https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF)                                                                                                       |
@@ -198,64 +199,64 @@ streamlit run chatbot/rag_chatbot_app.py -- --model llama-3.2 --k 2 --synthesis-
 ## References
 
 * Large Language Models (LLMs):
-  * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
-  * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
-  * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
-  * [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
+    * [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
+    * [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
+    * [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
+    * [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
 * LLM Frameworks:
-  * llama.cpp:
-    * [llama.cpp](https://github.com/ggerganov/llama.cpp)
-    * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
-  * Ollama:
-    * [Ollama](https://github.com/ollama/ollama/tree/main)
-    * [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main)
-    * [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/)
-    * [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975)
-    * [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training)
+    * llama.cpp:
+        * [llama.cpp](https://github.com/ggerganov/llama.cpp)
+        * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
+    * Ollama:
+        * [Ollama](https://github.com/ollama/ollama/tree/main)
+        * [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main)
+        * [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/)
+        * [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975)
+        * [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training)
 * Agent Frameworks:
-  * [PydanticAI](https://ai.pydantic.dev/)
+    * [PydanticAI](https://ai.pydantic.dev/)
 * Embeddings:
-  * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
-      * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
-        space and can be used for tasks like clustering or semantic search.
+    * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+        * This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
+          space and can be used for tasks like clustering or semantic search.
 * Vector Databases:
-  * Indexing algorithms:
-      * There are many algorithms for building indexes to optimize vector search. Most vector databases
-        implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
-        articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
-          * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
-          * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
-          * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
-          * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
-          * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
-            expense of speed.
-  * [Chroma](https://www.trychroma.com/)
-    * [chroma](https://github.com/chroma-core/chroma)
-  * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
+    * Indexing algorithms:
+        * There are many algorithms for building indexes to optimize vector search. Most vector databases
+          implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
+          articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
+            * [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
+            * [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
+            * [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
+            * [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
+            * > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
+              expense of speed.
+    * [Chroma](https://www.trychroma.com/)
+        * [chroma](https://github.com/chroma-core/chroma)
+    * [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
 * Retrieval Augmented Generation (RAG):
-  * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
-  * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
-      * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
-        we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
-  * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
-  * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
-  * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
-  * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
+    * [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
+    * [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
+        * > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
+          we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
+    * [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
+    * [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
+    * [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
+    * [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
 * Chatbot UI:
-  * [Streamlit](https://discuss.streamlit.io/):
-      * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
-      * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
-      * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
-      * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
-          * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
-      * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
-      * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
-  * [Open WebUI](https://github.com/open-webui/open-webui)
-    * [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/)
+    * [Streamlit](https://discuss.streamlit.io/):
+        * [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
+        * [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
+        * [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
+        * [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
+            * [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
+        * [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
+        * [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
+    * [Open WebUI](https://github.com/open-webui/open-webui)
+        * [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/)
 * Text Processing and Cleaning:
-  * [clean-text](https://github.com/jfilter/clean-text/tree/main)
+    * [clean-text](https://github.com/jfilter/clean-text/tree/main)
 * Inspirational Open Source Repositories:
-  * [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
-  * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
-  * [AnythingLLM](https://useanything.com/)
-  * [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
+    * [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
+    * [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
+    * [AnythingLLM](https://useanything.com/)
+    * [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
diff --git a/chatbot/bot/model/model_registry.py b/chatbot/bot/model/model_registry.py
@@ -1,33 +1,32 @@
 from enum import Enum
 
-from bot.model.settings.llama import Llama31Settings, Llama32Settings
+from bot.model.settings.llama import Llama31Settings, Llama32OneSettings, Llama32ThreeSettings
 from bot.model.settings.openchat import OpenChat35Settings, OpenChat36Settings
 from bot.model.settings.phi import Phi35Settings
 from bot.model.settings.stablelm_zephyr import StableLMZephyrSettings
 from bot.model.settings.starling import StarlingSettings
 
 
-class ModelType(Enum):
-    ZEPHYR = "zephyr"
-    MISTRAL = "mistral"
-    DOLPHIN = "dolphin"
+class Model(Enum):
     STABLELM_ZEPHYR = "stablelm-zephyr"
     OPENCHAT_3_5 = "openchat-3.5"
     OPENCHAT_3_6 = "openchat-3.6"
     STARLING = "starling"
     PHI_3_5 = "phi-3.5"
     LLAMA_3_1 = "llama-3.1"
-    LLAMA_3_2 = "llama-3.2"
+    LLAMA_3_2_one = "llama-3.2:1b"
+    LLAMA_3_2_three = "llama-3.2"
 
 
 SUPPORTED_MODELS = {
-    ModelType.STABLELM_ZEPHYR.value: StableLMZephyrSettings,
-    ModelType.OPENCHAT_3_5.value: OpenChat35Settings,
-    ModelType.OPENCHAT_3_6.value: OpenChat36Settings,
-    ModelType.STARLING.value: StarlingSettings,
-    ModelType.PHI_3_5.value: Phi35Settings,
-    ModelType.LLAMA_3_1.value: Llama31Settings,
-    ModelType.LLAMA_3_2.value: Llama32Settings,
+    Model.STABLELM_ZEPHYR.value: StableLMZephyrSettings,
+    Model.OPENCHAT_3_5.value: OpenChat35Settings,
+    Model.OPENCHAT_3_6.value: OpenChat36Settings,
+    Model.STARLING.value: StarlingSettings,
+    Model.PHI_3_5.value: Phi35Settings,
+    Model.LLAMA_3_1.value: Llama31Settings,
+    Model.LLAMA_3_2_one.value: Llama32OneSettings,
+    Model.LLAMA_3_2_three.value: Llama32ThreeSettings,
 }
 
 

diff --git a/chatbot/bot/model/settings/llama.py b/chatbot/bot/model/settings/llama.py
@@ -12,7 +12,18 @@ class Llama31Settings(ModelSettings):
     config_answer = {"temperature": 0.7, "stop": []}
 
 
-class Llama32Settings(ModelSettings):
+class Llama32OneSettings(ModelSettings):
+    url = "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf"
+    file_name = "Llama-3.2-1B-Instruct-Q5_K_M.gguf"
+    config = {
+        "n_ctx": 4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
+        "n_threads": 8,  # The number of CPU threads to use, tailor to your system and the resulting performance
+        "n_gpu_layers": 50,  # The number of layers to offload to GPU, if you have GPU acceleration available
+    }
+    config_answer = {"temperature": 0.7, "stop": []}
+
+
+class Llama32ThreeSettings(ModelSettings):
     # There is also the uncensored version: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-uncensored-GGUF
     url = "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q5_K_M.gguf"
     file_name = "Llama-3.2-3B-Instruct-Q5_K_M.gguf"

diff --git a/demo.md b/demo.md
@@ -1,21 +1,14 @@
 # Story Chatbot - 1
 
-- Tell me something about Italy
+- Tell me something about Italy. Be concise.
 - How many people live there?
 - Can you tell me the names of the countries that share a border with Italy?
 - Could you please remind me about the topic we were discussing earlier?
 
 # Story Chatbot - 2
 
-- In which country is Italy?
-- Can you tell me the names of the countries that share a border with Italy?
-- Could you please provide me with information on the main industries?
-- Could you please remind me about the topic we were discussing earlier?
-
-# Story Chatbot - 3
-
 - Can you help me create a personalized morning routine that would help increase my productivity throughout the day? Start by asking me about my current habits and what activities energize me in the morning.
-- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterwards.
+- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterward.
 
 # Programming - 1
 
@@ -85,7 +78,7 @@ Make it X-rated and disgusting.
 
 # Story Rag Chatbot - 1
 
-- Tell me something about the Blendle Social Code
-- What is the number of holidays per year?
+- Tell me something about the Blendle Social Code. Be concise.
+- What is the total amount of days off per year?
 - What are the perks and benefits?
 - Could you please remind me about the topic we were discussing earlier?
diff --git a/images/conversation-aware-chatbot.gif b/images/conversation-aware-chatbot.gif
diff --git a/images/rag_chatbot_example.gif b/images/rag_chatbot_example.gif
diff --git a/tests/bot/client/test_lamacpp_client.py b/tests/bot/client/test_lamacpp_client.py
@@ -3,7 +3,7 @@
 
 import pytest
 from bot.client.lama_cpp_client import LamaCppClient
-from bot.model.model_registry import ModelType, get_model_settings
+from bot.model.model_registry import Model, get_model_settings
 
 
 @pytest.fixture
@@ -18,7 +18,7 @@ def cpu_config():
 
 @pytest.fixture
 def model_settings():
-    model_setting = get_model_settings(ModelType.LLAMA_3_2.value)
+    model_setting = get_model_settings(Model.LLAMA_3_2.value)
     return model_setting