Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

ochafik · 2024-09-25T15:37:26Z

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Background

It tackles two main problems related to tool calling:

Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.
- Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).
  - For Llama3.1-Instruct (cf. llama-stack-apps repo) for instance, triggers are <|python_tag|> and {"name": "toolN" (for each toolN in the list of tools in the request).
  - For Llama3.2-Instruct, we eagerly trigger on{" which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.
  - For Hermes Pro (cf. Hermes-Function-Calling repo), it's <tool_call>.
  - For Functionary v3.llama3, it's >>>toolN\n for each toolN.
  - For Functionary v3-llama3.1, it's <function= and <|python_tag|>
  - For Mistral Nemo, the trigger ought to be [TOOL_CALLS] but it doesn't seem to (ever?) be emitted, so we're triggering on {" instead for now.
  - For other models ("generic" tool call style), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
- Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the parts of this PR that could possibly be sent separately (~~currently itemized~~ to be reitemized as commits):

grammar_trigger_words + llama_antiprompts: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligning cli & server (e.g. single-token stop logic) and handling grammar trigger words.
minja.hpp + test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}: minimal Jinja templating engine and its tests against actual templates & a few test contexts
Tool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration in llama-server (fenced by --jinja) w/ tools, tool_choice support + updated response_format compliance.
Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)

How to use / test

Run llama-server w/ jinja templates. Note that most models need a template override (the HF to GGUF conversion only retains a single chat_template, but sometimes the models only support tool calls in an alternative chat template).

make -j LLAMA_CURL=1 llama-server

# Nous Hermes 2 Pro Llama 3 8B (recommended at that size)
./llama-server --jinja -fa --verbose \
  -hfr NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF -hff Hermes-2-Pro-Llama-3-8B-Q8_0.gguf \
  --chat-template-file tests/chat/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja

# Mistral Nemo
./llama-server --jinja -fa --verbose \
  -hfr bartowski/Mistral-Nemo-Instruct-2407-GGUF -hff Mistral-Nemo-Instruct-2407-Q8_0.gguf \
  --chat-template-file tests/chat/templates/mistralai-Mistral-Nemo-Instruct-2407.jinja

# Llama 3.1 8B
./llama-server --jinja -fa --verbose \
  -hfr lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF -hff Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf

# functionary-small-v3
./llama-server --jinja -fa --verbose \
  -hfr meetkai/functionary-small-v3.2-GGUF -hff functionary-small-v3.2.Q4_0.gguf \
  --chat-template-file tests/chat/templates/meetkai-functionary-medium-v3.2.jinja

# Llama 3.2 3B (poor compliance)
./llama-server --jinja -fa --verbose \
  -hfr lmstudio-community/Llama-3.2-3B-Instruct-GGUF -hff Llama-3.2-3B-Instruct-Q6_K_L.gguf \
  --chat-template-file tests/chat/templates/meta-llama-Llama-3.2-3B-Instruct.jinja

Expose the functions in examples/agent/tools as a FastAPI service inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐):
```
examples/agent/serve_tools_inside_docker.sh
```
[!WARNING]
The command above gives tools (and your agent) access to the web (and read-only access to examples/agent/**). You can loosen / restrict web access in examples/agent/squid/conf/squid.conf.

Run the example agent with a simple goal and access to the tools service (define OPENAI_API_KEY and add --openai to compare to OpenAI):

uv run examples/agent/run.py --tools http://localhost:8088 \
  "What is the sum of 2535 squared and 32222000403?"

> 🛠️  fetch_page, wait_for_date, wait_for_duration, python, brave_search
> ⚙️  python(code="print(2535**2 + 32222000403)") → 32228426628
> 
> The function "python" executes the given Python code and returns the output. In this case, the code is printing the sum of 2535 squared and 32222000403.

uv run examples/agent/run.py --tools http://localhost:8088 \
  "What is the best BBQ join in Laguna Beach?"

> 🛠️  Tools: python, fetch_page, brave_search
> ⚙️  brave_search(query="best bbq joint in laguna beach")
> → 4283 chars
> Based on the search results, Beach Pit BBQ seems to be a popular and highly-rated BBQ joint in Laguna Beach. They offer a variety of BBQ options, including ribs, pulled pork, brisket, salads, wings, and more. They have dine-in, take-out, and catering options available.

uv run examples/agent/run.py --tools http://localhost:8088 \
  "Search for, fetch and summarize the homepage of llama.cpp"

🛠️  Tools: python, fetch_page, brave_search
⚙️  brave_search(query="summary of homepage of llama.cpp")
 → 10 items
⚙️  brave_search(query="homepage of llama.cpp")
 → 10 items
 ⚙️  fetch_page(url="https://github.com/ggerganov/llama.cpp")
 → 47397 chars
  "Skip to content\n\n## Navigation Menu\n\nToggle navigation\n\n[ ](/)\n\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fggerganov%2Fllama.cpp)\n\n  * Product \n\n    * [ Actions Automate any workflow  ](https://github.com/features/actions)\n    * [ Security Find and fix vulnerabilities  ](https://github.com/features/security)\n    * [ Codespaces Instant dev environments  ](https://github.com/features/codespaces)\n    * [ GitHub Copilot Write better code with AI  ](https://github.com/features/copilot)\n    * [ Code review Manage code changes  ](https://github.com/features/code-review)\n    * [ Issues Plan and track work  ](https://github.com/features/issues)\n    * [ Discussions Collaborate outside of code  ](https://github.com/features/discussions)\n\nExplore\n\n    * [ All features ](https://github.com/features)\n    * [ Documentation  ](https://docs.github.com)\n    * [ GitHub Skills  ](https://skills.github.com)\n    * [ Blog  ](https://github.blog)\n\n  * Solutions \n\nBy 
The home page of llama.cpp is a GitHub repository that provides a C/C++ implementation of LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware. It includes support for Apple silicon, AVX, AVX2, and AVX512 architectures, as well as custom CUDA kernels for running LLMs on NVIDIA GPUs. The repository also includes a web server that can be used to serve local models and easily connect them to existing clients.

The home page also provides a list of supported models, including LLaMA, LLaMA 2, LLaMA 3, Mistral 7B, and others. It also includes information on how to build and install llama.cpp, as well as how to use the web server and the command-line interface.

The home page also includes a list of tools and resources, including a GitHub repository for the llama.cpp project, a Hugging Face space for preparing and quantizing models, and a browser app for writing GBNF grammars.

Overall, the home page of llama.cpp provides a wealth of information on how to use and contribute to the project, as well as a list of resources and tools that can be used to work with the project.

TODOs before undrafting:

Possible follow ups:

Add tool call loop to the default web chat using Pyodide as a python interpreter?

…enerators + parsers

…late

…date_model_chat_template

…f taking server down

…ormat_chat

…taking server down

Maximilian-Winter · 2024-10-07T16:57:07Z

@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow.

ochafik · 2024-10-17T18:35:06Z

@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅)

I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next.

Depending on your timezone, happy to jump into a video chat too :-) (DM on x?)

(Also, llama-cpp-agent looks suuuper cool! 💜)

Maximilian-Winter · 2024-10-18T23:50:52Z

@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is [email protected]

…mo template

…e mentions tool_call_id

…fic w/ docker compose + self-signed squid proxy

…ng / trust the self-signed cert from python

ochafik added 6 commits September 25, 2024 16:14

grammar: trigger words + refactor of antiprompts

5b6d504

minja: minimalist Jinja templating engine for LLM chat templates

eaca756

json: build_grammar helper

26c175b

tool-call: basic Functionary 3.2, Llama 3.1, Hermes 2 Pro grammar g…

3cfc21e

…enerators + parsers

tool-call: integrate minja & tool-call to server when --jinja is set

e309c6a

server: add --chat-template-file

41103c0

github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024

tool-call: support Functionary v3 vs. v3-llama3.1 variants

4706bdb

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024

tool-call: add basic usage example to server readme

8f25531

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024

ochafik added 16 commits September 25, 2024 18:58

Merge remote-tracking branch 'origin/master' into tool-call

33ea20e

tool-call: add output example to readme

d15dcfb

minja: fetch more templates (add models from test-chat-template)

97d0620

tool-call: fix llama_chat_apply_template signature / test-chat-temp…

e983c9d

…late

minja: fix llama_chat_apply_template + adde use_jinja param to vali…

45b243b

…date_model_chat_template

server: fix tailing comma in completions_seed

9e366b3

tool-call: add server tests for llama 3.1

a774093

server: catch errors in oaicompat_completion_params_parse instead o…

d928ff4

…f taking server down

tool-call: allow empty message content when there's tool_calls in f…

ab25e3f

…ormat_chat

fix editorconfig lints

1b62801

fix flake8 lints

76d2938

minja: add str.endswith

c124ab4

tool-call: fix/test functionary v3

595e11c

server: catch errors in format_final_response_oaicompat instead of …

94377d7

…taking server down

minja: try to please gcc

059babd

tool-call: fix pyright type errors

4cd82d6

ochafik added 2 commits October 7, 2024 02:22

agent: disable brave_search when BRAVE_SEARCH_API_KEY unset

241acc2

tool-call: accept {"type": "function", "name": "fn" for llama 3.x

3325069

agent: move openapi helpers to their own file

e753f15

ochafik added 19 commits October 22, 2024 10:50

tool-call: fix grammar roots

7576487

fix root

fa8462f

tool-calls: add generic tool call style as default

9f5ab97

Update test-tool-call.cpp

b53362a

tool-calls: fix grammar regression

7f2429e

Merge remote-tracking branch 'origin/master' into tool-call

db4bf93

Update llama-sampling.cpp

351aecb

minja: fix string subscripts, add string pipe to support Mistral-Ne…

a4f12a4

…mo template

tool-call: Log tool call style name, ensure returned content not null

fc80ad2

tool-calls: basic Nemo support, default parallel to true if templat…

3e12b9b

…e mentions tool_call_id

tool-call: fix previous commit's parallel arg

2b49440

Merge remote-tracking branch 'origin/master' into tool-call

5f4aef1

Update tool-call.cpp

4394e1c

Merge branch 'tool-call' of github.com:ochafik/llama.cpp into tool-call

414f6f1

agent: isolate tools container + log its outgoing HTTP & HTTPS traf…

267e630

…fic w/ docker compose + self-signed squid proxy

tool-call: return tool_call.id (required by Nemo)

f5320af

agent: display http errors nicely

0f5d639

agent: ditch aiohttp & define REQUESTS_CA_BUNDLE to fix http proxyi…

d338bfb

…ng / trust the self-signed cert from python

Update README.md

c2926e4

ochafik changed the title ~~Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Oct 24, 2024

ochafik added 4 commits October 24, 2024 12:30

agent: fix deps + make docker compose setup easier to debug

03b8641

agent: fix no-cache issue in squid for brave tool

0f4fc8c

agent: simplify tools setup

5c414a3

agent: fix tools setup

30bd00b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

ochafik commented Sep 25, 2024 •

edited

Loading

Maximilian-Winter commented Oct 7, 2024

ochafik commented Oct 17, 2024 •

edited

Loading

Maximilian-Winter commented Oct 18, 2024

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Are you sure you want to change the base?

Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639

Conversation

ochafik commented Sep 25, 2024 • edited Loading

Background

How to use / test

TODOs before undrafting:

Possible follow ups:

Maximilian-Winter commented Oct 7, 2024

ochafik commented Oct 17, 2024 • edited Loading

Maximilian-Winter commented Oct 18, 2024

ochafik commented Sep 25, 2024 •

edited

Loading

ochafik commented Oct 17, 2024 •

edited

Loading