Release llamafile v0.8.9 · Mozilla-Ocho/llamafile

This release gets Gemma2 working closer to how Google intended.

af22695 Make gemma2-27b-it the same as aistudio.google.com
41678c8 Add sliding window mask for Gemma2
140eed5 Add soft-capping to Gemma2

This release fixes Android support. You can now run LLMs on your phone
using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net)
for bug reports and and testing efforts. See also other bug fixes described
by the Cosmopolitan v3.5.4 and v3.5.3 release notes.

Our future replacement for the server now has an /embedding endpoint. On
my workstation, it's currently able to serve 851 requests per second for
a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings
model. None of the requests fail and 99th percentile latency is 56.74ms.

1346ef4 Create /embedding endpoint in new server
263d39b Use float to string conversion
0d62d05 Reclaim llama_decode() memory on cancelation
617d841 Remove ggml_context cache
46dda4f Refactor new server and get leak checker working
cd73243 Prevent vector overflow in llama.cpp

You can try the new embedding server as follows:

make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange

Compatibility with the old server's API of posting JSON content will be
added in upcoming changes. The same goes for the OpenAI API. The goal's
to be compatible with everything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile v0.8.9

Contributors