-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine #9639
base: master
Are you sure you want to change the base?
Conversation
…enerators + parsers
…date_model_chat_template
…f taking server down
…taking server down
@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow. |
@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅) I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next. Depending on your timezone, happy to jump into a video chat too :-) (DM on x?) (Also, llama-cpp-agent looks suuuper cool! 💜) |
@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is [email protected] |
…e mentions tool_call_id
…fic w/ docker compose + self-signed squid proxy
…ng / trust the self-signed cert from python
This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).
Background
It tackles two main problems related to tool calling:
Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if
"tool_choice": "required"
is specified in the request). It's not currently possible to say.* "<tool_call>" constrained "</tool_call>"
as the leading.*
will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in thethoughtful_steps
style, but the native tool call styles were still problematic.Solved w/ lazy grammars activated by trigger words (similar to stop words, refactored into same implementation). Output is completely unconstrained before triggers, and completely constrained after, which allows for
content
vs.tool_call
outputs, and even mixes of the two (for the few models that support that).<|python_tag|>
and{"name": "toolN"
(for eachtoolN
in the list oftools
in the request).{"
which isn't quite right but helps steer 1B & 3B models. Will try and detect model size to keep a more specific trigger for the bigger 3.2 models.<tool_call>
.>>>toolN\n
for eachtoolN
.<function=
and<|python_tag|>
[TOOL_CALLS]
but it doesn't seem to (ever?) be emitted, so we're triggering on{"
instead for now.tool_choice
isrequired
)Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
minja.hpp
), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.With this intro out of the way, here are the parts of this PR that could possibly be sent separately (
currently itemizedto be reitemized as commits):grammar_trigger_words +
llama_antiprompts
: refactors the stop logic (barebones Aho–Corasick algorithm to handle multiple stop words efficiently - with grammar trigger words we may have many), aligningcli
&server
(e.g. single-token stop logic) and handling grammar trigger words.minja.hpp
+test/{test-minja.cpp,update_jinja_goldens.py,chat/{contexts,templates,goldens}}
: minimal Jinja templating engine and its tests against actual templates & a few test contextsTool call grammar generation + output parsing logic for Llama 3.1, Functionary v3 (2 variants) and Hermes 2 Pro
Integration in
llama-server
(fenced by--jinja
) w/tools
,tool_choice
support + updatedresponse_format
compliance.Minimal examples/agent with a tool call / action loop, barebones tools and instructions / support to run them in a siloed docker container (see usage below)
How to use / test
Run
llama-server
w/ jinja templates. Note that most models need a template override (the HF to GGUF conversion only retains a singlechat_template
, but sometimes the models only support tool calls in an alternative chat template).Expose the functions in examples/agent/tools as a FastAPI service inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐):
Run the example agent with a simple goal and access to the tools service (define
OPENAI_API_KEY
and add--openai
to compare to OpenAI):Overall, the home page of llama.cpp provides a wealth of information on how to use and contribute to the project, as well as a list of resources and tools that can be used to work with the project.
TODOs before undrafting:
[TOOL_CALLS]
prefix never generated by model (expected, or bug in gguf/llama.cpp?)thoughtful_steps
tool support from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 (using JSON structured output even with models not trained for tool calling)"all\n"
in non-tool-call outputs for Functionary v3.2llama-cli
(^|\n)?{"
and otherwise not trigger spuriously elsewhere.Command R Plus,DeepSeek)--cache-prompt
defaults to true; follow up will be to allow in-slot restoration and saving of cache, see this branch for instancechat_template
should maybe be resolved earlier? (now allama_chat_template
class)llama_apply_chat_template would benefit from a massive facelift. Maybe passing in a struct?(have introduced a new C++ APIllama_chat_template::apply
)llama_token_to_piece(ctx, token)
should really take(model, token)
instead, but that's a breaking API change_llama_token_to_piece
that takes model. Movedllama_chat_template_from_model
helper tocommon.cpp
builtin_tools
andtodays_date
in llama3.1's template)test-chat-templates
&test-minja
(write each test case in a.jinja
file)bos_token
in the current chat template logicexamples/tool-call
) from [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389Possible follow ups: