Skip to content

Commit

Permalink
feat: add Llama 3.2 1B
Browse files Browse the repository at this point in the history
  • Loading branch information
umbertogriffo committed Dec 19, 2024
1 parent cc6e828 commit d017097
Show file tree
Hide file tree
Showing 7 changed files with 85 additions and 81 deletions.
109 changes: 55 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,8 @@ format.

| 🤖 Model | Supported | Model Size | Max Context Window | Notes and link to the model card |
|--------------------------------------------|-----------|------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `llama-3.2` Meta Llama 3.2 Instruct || 3B | 128k | **Recommended model** optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) |
| `llama-3.2` Meta Llama 3.2 Instruct || 1B | 128k | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF) |
| `llama-3.2` Meta Llama 3.2 Instruct || 3B | 128k | Optimized to run locally on a mobile or edge device - [Card](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) |
| `llama-3.1` Meta Llama 3.1 Instruct || 8B | 128k | **Recommended model** [Card](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF) |
| `openchat-3.6` - OpenChat 3.6 || 8B | 8192 | [Card](https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF) |
| `openchat-3.5` - OpenChat 3.5 || 7B | 8192 | [Card](https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF) |
Expand Down Expand Up @@ -198,64 +199,64 @@ streamlit run chatbot/rag_chatbot_app.py -- --model llama-3.2 --k 2 --synthesis-
## References

* Large Language Models (LLMs):
* [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
* [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
* [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
* [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
* [Calculating GPU memory for serving LLMs](https://www.substratus.ai/blog/calculating-gpu-memory-for-llm/)
* [GPT in 60 Lines of NumPy](https://jaykmody.com/blog/gpt-from-scratch/)
* [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
* [Uncensor any LLM with abliteration](https://huggingface.co/blog/mlabonne/abliteration)
* LLM Frameworks:
* llama.cpp:
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
* Ollama:
* [Ollama](https://github.com/ollama/ollama/tree/main)
* [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main)
* [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/)
* [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975)
* [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training)
* llama.cpp:
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
* Ollama:
* [Ollama](https://github.com/ollama/ollama/tree/main)
* [Ollama Python Library](https://github.com/ollama/ollama-python/tree/main)
* [On the architecture of ollama](https://blog.inoki.cc/2024/04/15/Ollama/)
* [Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool](https://medium.com/@rifewang/analysis-of-ollama-architecture-and-conversation-processing-flow-for-ai-llm-tool-ead4b9f40975)
* [How to Customize Ollama’s Storage Directory](https://medium.com/@chhaybunsy/unleash-your-machine-learning-models-how-to-customize-ollamas-storage-directory-c9ea1ea2961a#:~:text=By%20default%2C%20Ollama%20saves%20its,making%20predictions%20or%20further%20training)
* Agent Frameworks:
* [PydanticAI](https://ai.pydantic.dev/)
* [PydanticAI](https://ai.pydantic.dev/)
* Embeddings:
* [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
space and can be used for tasks like clustering or semantic search.
* [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
* This is a `sentence-transformers` model: It maps sentences & paragraphs to a 384 dimensional dense vector
space and can be used for tasks like clustering or semantic search.
* Vector Databases:
* Indexing algorithms:
* There are many algorithms for building indexes to optimize vector search. Most vector databases
implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
* [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
* [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
* [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
* [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
* > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
expense of speed.
* [Chroma](https://www.trychroma.com/)
* [chroma](https://github.com/chroma-core/chroma)
* [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
* Indexing algorithms:
* There are many algorithms for building indexes to optimize vector search. Most vector databases
implement `Hierarchical Navigable Small World (HNSW)` and/or `Inverted File Index (IVF)`. Here are some great
articles explaining them, and the trade-off between `speed`, `memory` and `quality`:
* [Nearest Neighbor Indexes for Similarity Search](https://www.pinecone.io/learn/series/faiss/vector-indexes/)
* [Hierarchical Navigable Small World (HNSW)](https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37)
* [From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/)
* [From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/)
* > PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the
expense of speed.
* [Chroma](https://www.trychroma.com/)
* [chroma](https://github.com/chroma-core/chroma)
* [Food Discovery with Qdrant](https://qdrant.tech/articles/new-recommendation-api/#)
* Retrieval Augmented Generation (RAG):
* [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
* [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
* > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
* [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
* [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
* [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
* [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
* [Building A Generative AI Platform](https://huyenchip.com/2024/07/25/genai-platform.html)
* [Rewrite-Retrieve-Read](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb)
* > Because the original query can not be always optimal to retrieve for the LLM, especially in the real world,
we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
* [Rerank](https://txt.cohere.com/rag-chatbot/#implement-reranking)
* [Building Response Synthesis from Scratch](https://gpt-index.readthedocs.io/en/latest/examples/low_level/response_synthesis.html#)
* [Conversational awareness](https://langstream.ai/2023/10/13/rag-chatbot-with-conversation/)
* [RAG is Dead, Again?](https://jina.ai/news/rag-is-dead-again/)
* Chatbot UI:
* [Streamlit](https://discuss.streamlit.io/):
* [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
* [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
* [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
* [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
* [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
* [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
* [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
* [Open WebUI](https://github.com/open-webui/open-webui)
* [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/)
* [Streamlit](https://discuss.streamlit.io/):
* [Build a basic LLM chat app](https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps#build-a-chatgpt-like-app)
* [Layouts and Containers](https://docs.streamlit.io/library/api-reference/layout)
* [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message)
* [Add statefulness to apps](https://docs.streamlit.io/library/advanced-features/session-state)
* [Why session state is not persisting between refresh?](https://discuss.streamlit.io/t/why-session-state-is-not-persisting-between-refresh/32020)
* [st.cache_resource](https://docs.streamlit.io/library/api-reference/performance/st.cache_resource)
* [Handling External Command Line Arguments](https://github.com/streamlit/streamlit/issues/337)
* [Open WebUI](https://github.com/open-webui/open-webui)
* [Running AI Locally Using Ollama on Ubuntu Linux](https://itsfoss.com/ollama-setup-linux/)
* Text Processing and Cleaning:
* [clean-text](https://github.com/jfilter/clean-text/tree/main)
* [clean-text](https://github.com/jfilter/clean-text/tree/main)
* Inspirational Open Source Repositories:
* [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
* [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
* [AnythingLLM](https://useanything.com/)
* [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
* [lit-gpt](https://github.com/Lightning-AI/lit-gpt)
* [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm)
* [AnythingLLM](https://useanything.com/)
* [FastServe - Serve Llama-cpp with FastAPI](https://github.com/aniketmaurya/fastserve)
25 changes: 12 additions & 13 deletions chatbot/bot/model/model_registry.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,32 @@
from enum import Enum

from bot.model.settings.llama import Llama31Settings, Llama32Settings
from bot.model.settings.llama import Llama31Settings, Llama32OneSettings, Llama32ThreeSettings
from bot.model.settings.openchat import OpenChat35Settings, OpenChat36Settings
from bot.model.settings.phi import Phi35Settings
from bot.model.settings.stablelm_zephyr import StableLMZephyrSettings
from bot.model.settings.starling import StarlingSettings


class ModelType(Enum):
ZEPHYR = "zephyr"
MISTRAL = "mistral"
DOLPHIN = "dolphin"
class Model(Enum):
STABLELM_ZEPHYR = "stablelm-zephyr"
OPENCHAT_3_5 = "openchat-3.5"
OPENCHAT_3_6 = "openchat-3.6"
STARLING = "starling"
PHI_3_5 = "phi-3.5"
LLAMA_3_1 = "llama-3.1"
LLAMA_3_2 = "llama-3.2"
LLAMA_3_2_one = "llama-3.2:1b"
LLAMA_3_2_three = "llama-3.2"


SUPPORTED_MODELS = {
ModelType.STABLELM_ZEPHYR.value: StableLMZephyrSettings,
ModelType.OPENCHAT_3_5.value: OpenChat35Settings,
ModelType.OPENCHAT_3_6.value: OpenChat36Settings,
ModelType.STARLING.value: StarlingSettings,
ModelType.PHI_3_5.value: Phi35Settings,
ModelType.LLAMA_3_1.value: Llama31Settings,
ModelType.LLAMA_3_2.value: Llama32Settings,
Model.STABLELM_ZEPHYR.value: StableLMZephyrSettings,
Model.OPENCHAT_3_5.value: OpenChat35Settings,
Model.OPENCHAT_3_6.value: OpenChat36Settings,
Model.STARLING.value: StarlingSettings,
Model.PHI_3_5.value: Phi35Settings,
Model.LLAMA_3_1.value: Llama31Settings,
Model.LLAMA_3_2_one.value: Llama32OneSettings,
Model.LLAMA_3_2_three.value: Llama32ThreeSettings,
}


Expand Down
13 changes: 12 additions & 1 deletion chatbot/bot/model/settings/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,18 @@ class Llama31Settings(ModelSettings):
config_answer = {"temperature": 0.7, "stop": []}


class Llama32Settings(ModelSettings):
class Llama32OneSettings(ModelSettings):
url = "https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf"
file_name = "Llama-3.2-1B-Instruct-Q5_K_M.gguf"
config = {
"n_ctx": 4096, # The max sequence length to use - note that longer sequence lengths require much more resources
"n_threads": 8, # The number of CPU threads to use, tailor to your system and the resulting performance
"n_gpu_layers": 50, # The number of layers to offload to GPU, if you have GPU acceleration available
}
config_answer = {"temperature": 0.7, "stop": []}


class Llama32ThreeSettings(ModelSettings):
# There is also the uncensored version: https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-uncensored-GGUF
url = "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q5_K_M.gguf"
file_name = "Llama-3.2-3B-Instruct-Q5_K_M.gguf"
Expand Down
15 changes: 4 additions & 11 deletions demo.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,14 @@
# Story Chatbot - 1

- Tell me something about Italy
- Tell me something about Italy. Be concise.
- How many people live there?
- Can you tell me the names of the countries that share a border with Italy?
- Could you please remind me about the topic we were discussing earlier?

# Story Chatbot - 2

- In which country is Italy?
- Can you tell me the names of the countries that share a border with Italy?
- Could you please provide me with information on the main industries?
- Could you please remind me about the topic we were discussing earlier?

# Story Chatbot - 3

- Can you help me create a personalized morning routine that would help increase my productivity throughout the day? Start by asking me about my current habits and what activities energize me in the morning.
- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterwards.
- I wake up at 7 am. I have breakfast, go to the bathroom and watch videos on Instagram. I continue to feel sleepy afterward.

# Programming - 1

Expand Down Expand Up @@ -85,7 +78,7 @@ Make it X-rated and disgusting.

# Story Rag Chatbot - 1

- Tell me something about the Blendle Social Code
- What is the number of holidays per year?
- Tell me something about the Blendle Social Code. Be concise.
- What is the total amount of days off per year?
- What are the perks and benefits?
- Could you please remind me about the topic we were discussing earlier?
Binary file modified images/conversation-aware-chatbot.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/rag_chatbot_example.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions tests/bot/client/test_lamacpp_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

import pytest
from bot.client.lama_cpp_client import LamaCppClient
from bot.model.model_registry import ModelType, get_model_settings
from bot.model.model_registry import Model, get_model_settings


@pytest.fixture
Expand All @@ -18,7 +18,7 @@ def cpu_config():

@pytest.fixture
def model_settings():
model_setting = get_model_settings(ModelType.LLAMA_3_2.value)
model_setting = get_model_settings(Model.LLAMA_3_2.value)
return model_setting


Expand Down

0 comments on commit d017097

Please sign in to comment.