Skip to content

llamafile v0.8.9

Latest
Compare
Choose a tag to compare
@jart jart released this 01 Jul 19:11
· 4 commits to main since this release
cd84736

This release gets Gemma2 working closer to how Google intended.

  • af22695 Make gemma2-27b-it the same as aistudio.google.com
  • 41678c8 Add sliding window mask for Gemma2
  • 140eed5 Add soft-capping to Gemma2

This release fixes Android support. You can now run LLMs on your phone
using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net)
for bug reports and and testing efforts. See also other bug fixes described
by the Cosmopolitan v3.5.4 and v3.5.3 release notes.

Our future replacement for the server now has an /embedding endpoint. On
my workstation, it's currently able to serve 851 requests per second for
a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings
model. None of the requests fail and 99th percentile latency is 56.74ms.

  • 1346ef4 Create /embedding endpoint in new server
  • 263d39b Use float to string conversion
  • 0d62d05 Reclaim llama_decode() memory on cancelation
  • 617d841 Remove ggml_context cache
  • 46dda4f Refactor new server and get leak checker working
  • cd73243 Prevent vector overflow in llama.cpp

You can try the new embedding server as follows:

make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange

Compatibility with the old server's API of posting JSON content will be
added in upcoming changes. The same goes for the OpenAI API. The goal's
to be compatible with everything.