Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Draft
wants to merge 124 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Sep 25, 2024

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

  • Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.

    • Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).

      • For Llama3.1-Instruct (cf. llama-stack-apps repo) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
      • For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
      • For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
      • For Functionary v3.llama3, it's >>>toolN\n for each toolN.
      • For Functionary v3-llama3.1, it's <function= and <|python_tag|>
      • For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
      • For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
  • Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.

    • Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (currently itemized to be reitemized as commits):

  • grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.

  • minja.hpp + test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}: minimal Jinja templating engine and its tests against actual templates & a few test contexts

  • Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro

  • Integration in llama-server (fenced by --jinja) w/ tools, tool_choice support + updated response_format compliance.

  • Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)

How to use / test

  • Run llama-server w/ jinja templates. Note that most models need a template override (the HF to GGUF conversion only retains a single chat_template, but sometimes the models only support tool calls in an alternative chat template).

    make -j LLAMA_CURL=1 llama-server
    
    # Nous Hermes 2 Pro Llama 3 8B (recommended at that size)
    ./llama-server --jinja -fa --verbose \
      -hfr NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF -hff Hermes-2-Pro-Llama-3-8B-Q8_0.gguf \
      --chat-template-file tests/chat/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja
    
    # Mistral Nemo
    ./llama-server --jinja -fa --verbose \
      -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
      --chat-template-file tests/chat/templates/mistralai-Mistral-Nemo-Instruct-2407.jinja
    
    # Llama 3.1 8B
    ./llama-server --jinja -fa --verbose \
      -hfr lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF -hff Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
    
    # functionary-small-v3
    ./llama-server --jinja -fa --verbose \
      -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q4_0.gguf \
      --chat-template-file tests/chat/templates/meetkai-functionary-medium-v3.2.jinja
    
    # Llama 3.2 3B (poor compliance)
    ./llama-server --jinja -fa --verbose \
      -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K_L.gguf \
      --chat-template-file tests/chat/templates/meta-llama-Llama-3.2-3B-Instruct.jinja
  • Expose the functions in examples/agent/tools as a FastAPI service inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐):

    examples/agent/serve_tools_inside_docker.sh

    [!WARNING]
    The command above gives tools (and your agent) access to the web (and read-only access to examples/agent/**). You can loosen / restrict web access in examples/agent/squid/conf/squid.conf.

  • Run the example agent with a simple goal and access to the tools service (define OPENAI_API_KEY and add --openai to compare to OpenAI):

    uv run examples/agent/run.py --tools http://localhost:8088 \
      "What is the sum of 2535 squared and 32222000403?"
    
    > 🛠️  fetch_page, wait_for_date, wait_for_duration, python, brave_search
    > ⚙️  python(code="print(2535**2 + 32222000403)") → 32228426628
    > 
    > The function "python" executes the given Python code and returns the output. In this case, the code is printing the sum of 2535 squared and 32222000403.
    uv run examples/agent/run.py --tools http://localhost:8088 \
      "What is the best BBQ join in Laguna Beach?"
    
    > 🛠️  Tools: python, fetch_page, brave_search
    > ⚙️  brave_search(query="best bbq joint in laguna beach")
    > → 4283 chars
    > Based on the search results, Beach Pit BBQ seems to be a popular and highly-rated BBQ joint in Laguna Beach. They offer a variety of BBQ options, including ribs, pulled pork, brisket, salads, wings, and more. They have dine-in, take-out, and catering options available.
    uv run examples/agent/run.py --tools http://localhost:8088 \
      "Search for, fetch and summarize the homepage of llama.cpp"
    
    🛠️  Tools: python, fetch_page, brave_search
    ⚙️  brave_search(query="summary of homepage of llama.cpp")
     → 10 items
    ⚙️  brave_search(query="homepage of llama.cpp")
     → 10 items
     ⚙️  fetch_page(url="https://github.com/ggerganov/llama.cpp")
     → 47397 chars
      "Skip to content\n\n## Navigation Menu\n\nToggle navigation\n\n[ ](/)\n\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fggerganov%2Fllama.cpp)\n\n  * Product \n\n    * [ Actions Automate any workflow  ](https://github.com/features/actions)\n    * [ Security Find and fix vulnerabilities  ](https://github.com/features/security)\n    * [ Codespaces Instant dev environments  ](https://github.com/features/codespaces)\n    * [ GitHub Copilot Write better code with AI  ](https://github.com/features/copilot)\n    * [ Code review Manage code changes  ](https://github.com/features/code-review)\n    * [ Issues Plan and track work  ](https://github.com/features/issues)\n    * [ Discussions Collaborate outside of code  ](https://github.com/features/discussions)\n\nExplore\n\n    * [ All features ](https://github.com/features)\n    * [ Documentation  ](https://docs.github.com)\n    * [ GitHub Skills  ](https://skills.github.com)\n    * [ Blog  ](https://github.blog)\n\n  * Solutions \n\nBy 
    The home page of llama.cpp is a GitHub repository that provides a C/C++ implementation of LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware. It includes support for Apple silicon, AVX, AVX2, and AVX512 architectures, as well as custom CUDA kernels for running LLMs on NVIDIA GPUs. The repository also includes a web server that can be used to serve local models and easily connect them to existing clients.
    
    The home page also provides a list of supported models, including LLaMA, LLaMA 2, LLaMA 3, Mistral 7B, and others. It also includes information on how to build and install llama.cpp, as well as how to use the web server and the command-line interface.
    
    The home page also includes a list of tools and resources, including a GitHub repository for the llama.cpp project, a Hugging Face space for preparing and quantizing models, and a browser app for writing GBNF grammars.
    

Overall, the home page of llama.cpp provides a wealth of information on how to use and contribute to the project, as well as a list of resources and tools that can be used to work with the project.

TODOs before undrafting:

  • Move minja to its own location w/ fuller testing (fuzzing, etc) or at least its own PR
  • Fix CI build (tests still failing on windows)
  • Nemo: investigate why [TOOL_CALLS] prefix never generated by model (expected, or bug in gguf/llama.cpp?)
  • Bring back generic thoughtful_steps tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)
  • Support streaming (of content - as long as it doesn't trigger any partial antiprompt match - and of individual tool calls)
  • Strip leading "all\n" in non-tool-call outputs for Functionary v3.2
  • Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1
  • Test w/ more models:
  • Add grammar trigger words support to llama-cli
  • Support regexps as antiprompts? Would allow triggering tool call grammar for small Llama 3.2 models (1B, 3B) on (^|\n)?{" and otherwise not trigger spuriously elsewhere.
  • Add support for broken templates (GML3..., Command R Plus, DeepSeek)
  • Add more tests (heavy e2e w/ actual models, tool_choice = none, parallel tool call, etc)
  • Add support for {"type": "code_interpreter"} (special-cased by functionary-medium-v3.1's template), maybe using ipython automatically for llama 3.1
  • Add configurable network isolation of tools w/ a proxy (also caches pip & deb packages & limits access to host)
  • KV cache saving / reuse (within session & beyond) in agent (--cache-prompt defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instance
  • Add tool call grammar tests (although indirectly covered by server "required" test cases)
  • Add more tools (brave search) + agent examples
  • Refactorings?
    • Ideally would pass some kind of ChatHandler between OAI init & final callback, and make it handle streaming / non streaming cases? (should parallel tool calls be streamed?)
    • chat_template should maybe be resolved earlier? (now a llama_chat_template class)
    • llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct? (have introduced a new C++ API llama_chat_template::apply)
    • llama_token_to_piece(ctx, token) should really take (model, token) instead, but that's a breaking API change
      • calls common-local _llama_token_to_piece that takes model. Moved llama_chat_template_from_model helper to common.cpp
  • Fix functionary-medium-* templates' golden generation
  • Add examples to server readme
  • Support key-value overrides for templates (e.g. builtin_tools and todays_date in llama3.1's template)
    • Done by tool call handler, not user-configurable
  • Unify test-chat-templates & test-minja (write each test case in a .jinja file)
    • Fix a couple of missing bos_token in the current chat template logic
  • Bring back agent / tool call loop example + python tools isolation in docker (examples/tool-call) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389
  • Test w/ meetkai/functionary-small-v3.2

Possible follow ups:

  • Add tool call loop to the default web chat using Pyodide as a python interpreter?

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024
@ochafik ochafik changed the title Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024
@Maximilian-Winter
Copy link
Contributor

@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow.

@ochafik
Copy link
Collaborator Author

ochafik commented Oct 17, 2024

@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅)

I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next.

Depending on your timezone, happy to jump into a video chat too :-) (DM on x?)

(Also, llama-cpp-agent looks suuuper cool! 💜)

@Maximilian-Winter
Copy link
Contributor

@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is [email protected]

@ochafik ochafik changed the title Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants